Towards Data Science – Medium Your home for data science. A publication sharing concepts, ideas and codes. – Medium

  • Explaining LLMs for RAG and Summarization
    by Daniel Klitzke on November 21, 2024 at 3:31 pm

    A fast and low-resource method using similarity-based attributionInformation flow between an input document and its summary as computed by the proposed explainability method. (image created by author)TL;DRExplaining LLMs is very slow and resource-intensive.This article proposes a task-specific explanation technique or RAG Q&A and Summarization.The approach is model agnostic and is similarity-based.The approach is low-resource and low-latency, so can run almost everywhere.I provided the code on Github, using the Huggingface Transformers ecosystem.MotivationThere are a lot of good reasons to get explanations for your model outputs. For example, they could help you find problems with your model, or they just could be a way to provide more transparency to the user, thereby facilitating user trust. This is why, for models like XGBoost, I have regularly applied methods like SHAP to get more insights into my model’s behavior.Now, with myself more and more dealing with LLM-based ML systems, I wanted to explore ways of explaining LLM models the same way I did with more traditional ML approaches. However, I quickly found myself being stuck because:SHAP does offer examples for text-based models, but for me, they failed with newer models, as SHAP did not support the embedding layers.Captum also offers a tutorial for LLM attribution; however, both presented methods also had their very specific drawbacks. Concretely, the perturbation-based method was simply too slow, while the gradient-based method was letting my GPU memory explode and ultimately failed.After playing with quantization and even spinning up GPU cloud instances with still limited success I had enough I took a step back.A Similarity-based ApproachFor understanding the approach, let’s first briefly define what we want to achieve. Concretely, we want to identify and highlight sections in our input text (e.g. long text document or RAG context) that are highly relevant to our model output (e.g., a summary or RAG answer).Typical flow of tasks our explainability method is applicable to. (image created by author)In case of summarization, our method would have to highlight parts of the original input text that are highly reflected in the summary. In case of a RAG system, our approach would have to highlight document chunks from the RAG context that are showing up in the answer.Since directly explaining the LLM itself has proven intractable for me, I instead propose to model the relation between model inputs and outputs via a separate text similarity model. Concretely, I implemented the following simple but effective approach:I split the model inputs and outputs into sentences.I calculate pairwise similarities between all sentences.I then normalize the similarity scores using SoftmaxAfter that, I visualize the similarities between input and output sentences in a nice plotIn code, this is implemented as shown below. For running the code you need the Huggingface Transformers, Sentence Transformers, and NLTK libraries.Please, also check out this Github Repository for the full code accompanying this blog post.from sentence_transformers import SentenceTransformerfrom nltk.tokenize import sent_tokenizeimport numpy as np# Original text truncated for brevity …text = “””This section briefly summarizes the state of the art in the area of semantic segmentation and semantic instance segmentation. As the majority of state-of-the-art techniques in this area are deep learning approaches we will focus on this area. Early deep learning-based approaches that aim at assigning semantic classes to the pixels of an image are based on patch classification. Here the image is decomposed into superpixels in a preprocessing step e.g. by applying the SLIC algorithm [1].Other approaches are based on so-called Fully Convolutional Neural Networks (FCNs). Here not an image patch but the whole image are taken as input and the output is a two-dimensional feature map that assigns class probabilities to each pixel. Conceptually FCNs are similar to CNNs used for classification but the fully connected layers are usually replaced by transposed convolutions which have learnable parameters and can learn to upsample the extracted features to the final pixel-wise classification result. …”””# Define a concise summary that captures the key pointssummary = “Semantic segmentation has evolved from early patch-based classification approaches using superpixels to more advanced Fully Convolutional Networks (FCNs) that process entire images and output pixel-wise classifications.”# Load the embedding modelmodel = SentenceTransformer(‘BAAI/bge-small-en’)# Split texts into sentencesinput_sentences = sent_tokenize(text)summary_sentences = sent_tokenize(summary)# Calculate embeddings for all sentencesinput_embeddings = model.encode(input_sentences)summary_embeddings = model.encode(summary_sentences)# Calculate similarity matrix using cosine similaritysimilarity_matrix = np.zeros((len(summary_sentences), len(input_sentences)))for i, sum_emb in enumerate(summary_embeddings): for j, inp_emb in enumerate(input_embeddings): similarity = np.dot(sum_emb, inp_emb) / (np.linalg.norm(sum_emb) * np.linalg.norm(inp_emb)) similarity_matrix[i, j] = similarity# Calculate final attribution scores (mean aggregation)final_scores = np.mean(similarity_matrix, axis=0)# Create and print attribution dictionaryattributions = { sentence: float(score) for sentence, score in zip(input_sentences, final_scores)}print(“\nInput sentences and their attribution scores:”)for sentence, score in attributions.items(): print(f”\nScore {score:.3f}: {sentence}”)So, as you can see, so far, that is pretty simple. Obviously, we don’t explain the model itself. However, we might be able to get a good sense of relations between input and output sentences for this specific type of tasks (summarization / RAG Q&A). But how does this actually perform and how to visualize the attribution results to make sense of the output?Evaluation for RAG and SummarizationTo visualize the outputs of this approach, I created two visualizations that are suitable for showing the feature attributions or connections between input and output of the LLM, respectively.These visualizations were generated for a summary of the LLM input that goes as follows:This section discusses the state of the art in semantic segmentation and instance segmentation, focusing on deep learning approaches. Early patch classification methods use superpixels, while more recent fully convolutional networks (FCNs) predict class probabilities for each pixel. FCNs are similar to CNNs but use transposed convolutions for upsampling. Standard architectures include U-Net and VGG-based FCNs, which are optimized for computational efficiency and feature size. For instance segmentation, proposal-based and instance embedding-based techniques are reviewed, including the use of proposals for instance segmentation and the concept of instance embeddings.Visualizing the Feature AttributionsFor visualizing the feature attributions, my choice was to simply stick to the original representation of the input data as close as possible.Visualization of sentence-wise feature attribution scores based on color mapping. (image created by author)Concretely, I simply plot the sentences, including their calculated attribution scores. Therefore, I map the attribution scores to the colors of the respective sentences.In this case, this shows us some dominant patterns in the summarization and the source sentences that the information might be stemming from. Concretely, the dominance of mentions of FCNs as an architecture variant mentioned in the text, as well as the mention of proposal- and instance embedding-based instance segmentation methods, are clearly highlighted.In general, this method turned out to work pretty well for easily capturing attributions on the input of a summarization task, as it is very close to the original representation and adds very low clutter to the data. I could imagine also providing such a visualization to the user of a RAG system on demand. Potentially, the outputs could also be further processed to threshold to certain especially relevant chunks; then, this could also be displayed to the user by default to highlight relevant sources.Again, check out the Github Repository to get the visualization codeVisualizing the Information FlowAnother visualization technique focuses not on the feature attributions, but mostly on the flow of information between input text and summary.Visualization of the information flow between sentences in Input text and summary as Sankey diagram. (image created by author)Concretely, what I do here, is to first determine the major connections between input and output sentences based on the attribution scores. I then visualize those connections using a Sankey diagram. Here, the width of the flow connections is the strength of the connection, and the coloring is done based on the sentences in the summary for better traceability.Here, it shows that the summary mostly follows the order of the text. However, there are few parts where the LLM might have combined information from the beginning and the end of the text, e.g., the summary mentions a focus on deep learning approaches in the first sentence. This is taken from the last sentence of the input text and is clearly shown in the flow chart.In general, I found this to be useful, especially to get a sense on how much the LLM is aggregating information together from different parts of the input, rather than just copying or rephrasing certain parts. In my opinion, this can also be useful to estimate how much potential for error there is if an output is relying too much on the LLM for making connections between different bits of information.Possible Extensions and AdaptationsIn the code provided on Github I implemented certain extensions of the basic approach shown in the previous sections. Concretely I explored the following:Use of different aggregations, such as max, for the similarity score.This can make sense as not necessarily the mean similarity to output sentences is relevant. Already one good hit could be relevant for out explanation.Use of different window sizes, e.g., taking chunks of three sentences to compute similarities.This again makes sense if suspecting that one sentence alone is not enough content to truly capture relatedness of two sentences so a larger context is created.Use of cross-encoding-based models, such as rerankers.This could be useful as rerankers are more rexplicitely modeling the relatedness of two input documents in one model, being way more sensitive to nuanced language in the two documents. See also my recent post on Towards Data Science.As said, all this is demoed in the provided Code so make sure to check that out as well.ConclusionIn general, I found it pretty challenging to find tutorials that truly demonstrate explainability techniques for non-toy scenarios in RAG and summarization. Especially techniques that are useful in “real-time” scenarios, and are thus providing low-latency seemed to be scarce. However, as shown in this post, simple solutions can already give quite nice results when it comes to showing relations between documents and answers in a RAG use case. I will definitely explore this further and see how I can probably use that in RAG production scenarios, as providing traceable outputs to the users has proven invaluable to me. If you are interested in the topic and want to get more content in this style, follow me here on Medium and on LinkedIn.Explaining LLMs for RAG and Summarization was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • Getting Started with Multimodal AI, One-Hot Encoding, and Other Beginner-Friendly Guides
    by TDS Editors on November 21, 2024 at 3:14 pm

    Getting Started with Multimodal AI, CPUs and GPUs, One-Hot Encoding, and Other Beginner-Friendly GuidesFeeling inspired to write your first TDS post? We’re always open to contributions from new authors.Taking the first step towards mastering a new topic is always a bit daunting—sometimes it’s even very daunting! It doesn’t matter if you’re learning about algorithms for the first time, dipping your toes into the exciting world of LLMs, or have just been tasked with revamping your team’s data stack: taking on a challenge with little or no prior experience requires nontrivial amounts of courage and grit.The calm and nuanced perspective of more seasoned practitioners can go a long way, too — which is where our authors excel. This week, we’ve gathered some of our standout recent contributions that are tailored specifically to the needs of early-stage learners attempting to expand their skill set. Let’s roll up our sleeves and get started!From Parallel Computing Principles to Programming for CPU and GPU ArchitecturesFor freshly minted data scientists and ML engineers, few areas are more crucial to understand than memory fundamentals and parallel execution. Shreya Shukla’s thorough and accessible guide is the perfect resource to get a firm footing in this topic, focusing on how to write code for both CPU and GPU architectures to accomplish fundamental tasks like vector-matrix multiplication.Multimodal Models — LLMs That Can See and HearIf you’re feeling confident in your knowledge of LLM basics, why not take the next step and explore multimodal models, which can take in (and in some cases, generate) multiple forms of data—from images to code and audio? Shaw Talebi’s primer, the first part of a new series, offers a solid foundation from which to build your practical know-how.Boosting Algorithms in Machine Learning, Part II: Gradient BoostingWhether you’ve only recently started your ML journey or have been at it for so long that a refresher might be useful, it’s never a bad idea to firm up your knowledge of the basics. Gurjinder Kaur’s ongoing exploration of boosting algorithms is a great case in point, presenting accessible, easy-to-digest breakdowns of some of the most powerful models out there—in this case, gradient boosting.Photo by Taria Camerino on UnsplashNLP Illustrated, Part 1: Text EncodingAnother new project we’re thrilled to share with our readers? Shreya Rao’s just-launched series of illustrated guides to core concepts in natural language processing, the very technology powering many of the fancy chatbots and AI apps that have made a splash in recent years. Part one zooms in on an essential step in just about any NLP workflow: turning textual data into numerical inputs via text encoding.Decoding One-Hot Encoding: A Beginner’s Guide to Categorical DataIf you’re looking to learn about another form of data transformation, don’t miss Vyacheslav Efimov’s clear and concise introduction to one-hot encoding, “one of the most fundamental techniques used for data preprocessing,” turning categorical features into numerical vectors.Excel Spreadsheets Are Dead for Big Data. Companies Need More Python Instead.One type of transition that is often even more difficult than learning a new topic is switching to a new tool or workflow—especially when the one you’re moving away from fits squarely within your comfort zone. As Ari Joury, PhD explains, however, sometimes a temporary sacrifice of speed and ease of use is worth it, as in the case of adopting Python-based data tools instead of Excel spreadsheets.Ready to venture out into other topics and challenges this week? We hope so—we’ve published some excellent articles recently on LLM apps, Python-generated art, AI ethics, and more:After building LLM-based applications this past year, Satwiki De shares practical insights on how the process diverges from traditional product-development norms.In his latest article, Robert Lange focuses on recent advances in neural-network training, and examines various methods of distributed training, such as data-parallel training and gossip-based averaging.Translating data analysis into valuable business decisions remains a perennial challenge for data professionals. Tessa Xie presents a fresh perspective on this problem—as well as several pragmatic recommendations.Anyone in the mood for a math deep dive should head right over to Reza Bagheri’s latest explainer, which walks us through the inner workings of the all-important softmax function.Having been disappointed by the outputs of generative-AI tools, Anna Gordun Peiro attempts to create Mondrian-inspired artwork using nothing but Python, and documents her process in an easy-to-follow tutorial.When you work with time series data, it’s essential to know whether your outlier treatment has been effective. Sara Nóbrega devotes her latest post to a detailed discussion of the various approaches you can use to evaluate the treatment’s impact.What does it take to create AI ethics and governance frameworks that function at scale? Jason Tamara Widjaja unpacks the challenges of bridging common organizational and implementation gaps.Writing at the intersection of music and AI, Jon Flynn walks us through some of the recent developments in this growing field, and zooms in on the Qwen2-Audio model, which is trained to transcribe musical inputs into sheet music.Thank you for supporting the work of our authors! As we mentioned above, we love publishing articles from new authors, so if you’ve recently written an interesting project walkthrough, tutorial, or theoretical reflection on any of our core topics, don’t hesitate to share it with us.Until the next Variable,TDS TeamGetting Started with Multimodal AI, One-Hot Encoding, and Other Beginner-Friendly Guides was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • How to Connect LlamaIndex with Private LLM API Deployments
    by Peng Qian on November 21, 2024 at 1:03 pm

    When your enterprise doesn’t use public models like OpenAIContinue reading on Towards Data Science »

  • How to Easily Deploy a Local Generative Search Engine Using VerifAI
    by Nikola Milosevic (Data Warrior) on November 21, 2024 at 12:02 pm

    An open-source initiative to help you deploy generative search based on your local files and self-hosted (Mistral, Llama 3.x) or commercial LLM models (GPT4, GPT4o, etc.)I have previously written about building your own simple generative search, as well as on the VerifAI project on Towards Data Science. However, there has been a major update worth revisiting. Initially, VerifAI was developed as a biomedical generative search with referenced and AI-verified answers. This version is still available, and we now call it VerifAI BioMed. It can be accessed here: https://app.verifai-project.com/.The major update, however, is that you can now index your local files and turn them into your own generative search engine (or productivity engine, as some refer to these systems based on GenAI). It can serve also as an enterprise or organizational generative search. We call this version VerifAI Core, as it serves as the foundation for the other version. In this article, we will explore how you can in a few simple steps, deploy it and start using it. Given that it has been written in Python, it can be run on any kind of operating system.ArchitectureThe best way to describe a generative search engine is by breaking it down into three parts (or components, in our case):IndexingRetrieval-Augmented Generation (RAG) MethodVerifAI contains an additional component, which is a verification engine, on top of the usual generative search capabilitiesIndexing in VerifAI can be done by pointing its indexer script to a local folder containing files such as PDF, MS Word, PowerPoint, Text, or Markdown (.md). The script reads and indexes these files. Indexing is performed in dual mode, utilizing both lexical and semantic indexing.For lexical indexing, VerifAI uses OpenSearch. For semantic indexing, it vectorizes chunks of the documents using an embedding model specified in the configuration file (models from Hugging Face are supported) and then stores these vectors in Qdrant. A visual representation of this process is shown in the diagram below.Architecture of indexing (diagram by author)When it comes to answering questions using VerifAI, the method is somewhat complex. User questions, written in natural language, undergo preprocessing (e.g., stopwords are excluded) and are then transformed into queries.For OpenSearch, only lexical processing is performed (e.g., excluding stopwords), and the most relevant documents are retrieved. For Qdrant, the query is transformed into embeddings using the same model that was used to embed document chunks when they were stored in Qdrant. These embeddings are then used to query Qdrant, retrieving the most similar documents based on dot product similarity. The dot product is employed because it accounts for both the angle and magnitude of the vectors.Finally, the results from the two engines must be merged. This is done by normalizing the retrieval scores from each engine to values between 0 and 1 (achieved by dividing each score by the highest score from its respective engine). Scores corresponding to the same document are then added together and sorted by their combined score in descending order.Using the retrieved documents, a prompt is built. The prompt contains instructions, the top documents, and the user’s question. This prompt is then passed to the large language model of choice (which can be specified in the configuration file, or, if no model is set, defaults to our locally deployed fine-tuned version of Mistral). Finally, a verification model is applied to ensure there are no hallucinations, and the answer is presented to the user through the GUI. The schematic of this process is shown in the image below.Architecture of retrieval, generation, and verification (image by author). The model is based on the combination of the following papers: https://arxiv.org/pdf/2407.11485, https://aclanthology.org/2024.bionlp-1.44/Installing the necessary librariesTo install VerifAI Generative Search, you can start by cloning the latest codebase from GitHub or using one of the available releases.git clone https://github.com/nikolamilosevic86/verifAI.gitWhen installing VerifAI Search, it is recommended to start by creating a clean Python environment. I have tested it with Python 3.6, but it should work with most Python 3 versions. However, Python 3.10+ may encounter compatibility issues with certain dependencies.To create a Python environment, you can use the venv library as follows:python -m venv verifaisource verifai/bin/activateAfter activating the environment, you can install the required libraries. The requirements file is located in the verifAI/backend directory. You can run the following command to install all the dependencies:pip install -r requirements.txtConfiguring systemThe next step is configuring VerifAI and its interactions with other tools. This can be done either by setting environment variables directly or by using an environment file (the preferred option).An example of an environment file for VerifAI is provided in the backend folder as .env.local.example. You can rename this file to .env, and the VerifAI backend will automatically read it. The file structure is as follows:SECRET_KEY=6293db7b3f4f67439ad61d1b798242b035ee36c4113bf870ALGORITHM=HS256DBNAME=verifai_databaseUSER_DB=myuserPASSWORD_DB=mypasswordHOST_DB=localhostOPENSEARCH_IP=localhostOPENSEARCH_USER=adminOPENSEARCH_PASSWORD=adminOPENSEARCH_PORT=9200OPENSEARCH_USE_SSL=FalseQDRANT_IP=localhostQDRANT_PORT=6333QDRANT_API=8da7625d78141e19a9bf3d878f4cb333fedb56eed9097904b46ce4c33e1ce085QDRANT_USE_SSL=FalseOPENAI_PATH=<model-deployment-path>OPENAI_KEY=<model-deployment-key>OPENAI_DEPLOYMENT_NAME=<name-of-model-deployment>MAX_CONTEXT_LENGTH=128000USE_VERIFICATION = TrueEMBEDDING_MODEL=”sentence-transformers/msmarco-bert-base-dot-v5″INDEX_NAME_LEXICAL = ‘myindex-lexical’INDEX_NAME_SEMANTIC = “myindex-semantic”Some of the variables are quite straightforward. The first Secret key and Algorithm are used for communication between the frontend and the backend.Then there are variables configuring access to the PostgreSQL database. It needs the database name (DBNAME), username, password, and host address where the database is located. In our case, it is on localhost, on the docker image.The next section is the configuration of OpenSearch access. There is IP (localhost in our case again), username, password, port number (default port is 9200), and variable defining whether to use SSL.A similar configuration section has Qdrant, just for Qdrant, we use an API key, which has to be here defined.The next section defined the generative model. VerifAI uses the OpenAI python library, which became the industry standard, and allows it to use both OpenAI API, Azure API, and user deployments via vLLM, OLlama, or Nvidia NIMs. The user needs to define the path to the interface, API key, and model deployment name that will be used. We are soon adding support where users can modify or change the prompt that is used for generation. In case no path to an interface is provided and no key, the model will download the Mistral 7B model, with the QLoRA adapter that we have fine-tuned, and deploy it locally. However, in case you do not have enough GPU RAM, or RAM in general, this may fail, or work terribly slowly.You can set also MAX_CONTEXT_LENGTH, in this case it is set to 128,000 tokens, as that is context size of GPT4o. The context length variable is used to build context. Generally, it is built by putting in instruction about answering question factually, with references, and then providing retrieved relevant documents and question. However, documents can be large, and exceed context length. If this happens, the documents are splitted in chunks and top n chunks that fit into the context size will be used to context.The next part contains the HuggingFace name of the model that is used for embeddings of documents in Qdrant. Finally, there are names of indexes both in OpenSearch (INDEX_NAME_LEXICAL) and Qdrant (INDEX_NAME_SEMANTIC).As we previously said, VerifAI has a component that verifies whether the generated claim is based on the provided and referenced document. However, this can be turned on or off, as for some use-cases this functionality is not needed. One can turn this off by setting USE_VERIFICATION to False.Installing datastoresThe final step of the installation is to run the install_datastores.py file. Before running this file, you need to install Docker and ensure that the Docker daemon is running. As this file reads configuration for setting up the user names, passwords, or API keys for the tools it is installing, it is necessary to first make a configuration file. This is explained in the next section.This script sets up the necessary components, including OpenSearch, Qdrant, and PostgreSQL, and creates a database in PostgreSQL.python install_datastores.pyNote that this script installs Qdrant and OpenSearch without SSL certificates, and the following instructions assume SSL is not required. If you need SSL for a production environment, you will need to configure it manually.Also, note that we are talking about local installation on docker here. If you already have Qdrant and OpenSearch deployed, you can simply update the configuration file to point to those instances.Indexing filesThis configuration is used by both the indexing method and the backend service. Therefore, it must be completed before indexing. Once the configuration is set up, you can run the indexing process by pointing index_files.py to the folder containing the files to be indexed:python index_files.py <path-to-directory-with-files>We have included a folder called test_data in the repository, which contains several test files (primarily my papers and other past writings). You can replace these files with your own and run the following:python index_files.py test_dataThis would run indexing over all files in that folder and its subfolders. Once finished, one can run VerifAI services for backend and frontend.Running the generative searchThe backend of VerifAI can be run simply by running:python main.pyThis will start the FastAPI service that would act as a backend, and pass requests to OpenSearch, and Qdrant to retrieve relevant files for given queries and to the deployment of LLM for generating answers, as well as utilize the local model for claim verification.Frontend is a folder called client-gui/verifai-ui and is written in React.js, and therefore would need a local installation of Node.js, and npm. Then you can simply install dependencies by running npm install and run the front end by running npm start:cd ..cd client-gui/verifai-uinpm installnpm startFinally, things should look somehow like this:One of the example questions, with verification turned on (note text in green) and reference to the file, which can be downloaded (screenshot by author)Screenshot showcasing tooltip of the verified claim, with the most similar sentence from the article presented (screenshot by author)Contributing and future directionSo far, VerifAI has been started with the help of funding from the Next Generation Internet Search project as a subgrant of the European Union. It was started as a collaboration between The Institute for Artificial Intelligence Research and Development of Serbia and Bayer A.G.. The first version has been developed as a generative search engine for biomedicine. This product will continue to run at https://app.verifai-project.com/. However, lately, we decided to expand the project, so it can truly become an open-source generative search with verifiable answers for any files, that can be leveraged openly by different enterprises, small and medium companies, non-governmental organizations, or governments. These modifications have been developed by Natasa Radmilovic and me voluntarily (huge shout out to Natasa!).However, given this is an open-source project, available on GitHub (https://github.com/nikolamilosevic86/verifAI), we are welcoming contributions by anyone, via pull requests, bug reports, feature requests, discussions, or anything else you can contribute with (feel free to get in touch — for both BioMed and Core (document generative search, as described here) versions website will remain the same — https://verifai-project.com). So we welcome you to contribute, start our project, and follow us in the future.How to Easily Deploy a Local Generative Search Engine Using VerifAI was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • Rewiring My Career: How I Transitioned from Electrical Engineering to Data Engineering
    by Loizos Loizou on November 21, 2024 at 6:11 am

    Data is booming. It comes in vast volumes and variety and this explosion comes with a plethora of job opportunities too. Is it worth switching to a data career now? My honest opinion: absolutely!It is worth mentioning that this article comes from an Electrical and Electronic Engineer graduate who went all the way and spent almost 8 years in academia learning about the Energy sector (and when I say all the way, I mean from a bachelor degree to a PhD and postdoc). Even though that’s also a high demand career in the job market, I decided to switch to a data engineer path instead.I constantly come across posts in forums and blogs where people from different disciplines ask about how to switch to a career in data. So this article will take you through my journey, and how an engineering graduate has nothing to worry about the transition to this new field. I’ll go through the market for data jobs, my story, and the skills that engineers have (whether it’s electrical, mechanical, electronic etc.) that equip them well for this fast moving field.Generative AI image that I got when I wrote my article title in ChatGPT. Impressive, isn’t it?Necessity for Data ProfessionalsAs technology continues to advance exponentially (IoT devices, AI, web services etc.) so does the amount of data generated every day. The result from this? The need for AI and Data professionals is currently at an all time high and I think it is only going to get higher. It’s currently at a level that the demand for these professionals severely outgrows supply and new job listings are popping out every day.According to Dice Tech Job Report, positions like Data Engineers and Data Scientists are amongst the fastest growing tech occupations. The reason being that companies have finally come to the realization that, with data, you can unlock unlimited business insights which can reveal their product’s strengths and weaknesses. That is, if analyzed the correct way.So what does this mean for the future data professionals looking for a job? The following should be true, at least for the next few years:Unlimited job listings: According to a recent report by LinkedIn, job postings for AI roles have surged by 119% over the past two years. Similarly, data engineering positions have seen a 98% increase. This highlights the urgency of companies to hire these kind of professionals.High salary potential: When demand exceeds supply, it immediately leads to higher salaries. These are fundamental laws of economics. Data professionals are now in an era where they have multiple options for a job since companies acknowledge the value they bring to their company.Multiple industry opportunities: Take my case for example. I worked in data for energy, retail and finance sectors. I consider myself data agnostic, since I am now able to choose from opportunities across a relatively wide range of industries.Future job growth: As mentioned before, the need for these professionals is only going to get higher since data comes in all shapes and sizes and there is a need for people that know how to handle it.Switching from another Engineering principle to DataSo here comes the million dollar question: How can an Engineer, whether that is mechanical, electronic, electrical, civil etc. switch to a career in data? Great question.Is it easy? No. Is it worth it? Definitely. There’s no correct answer for this question. However, I can tell you my experiences and you can judge on your own. I can also tell you the similarities I found between my engineering degree and what I’m doing now. So let’s start.A brief story of how I switched to data engineering:Years 2020–2022The year is 2020 and I’m about to finish my PhD. Confused about my options and what I can do after a long 4-year PhD (and with a severe imposter syndrome too), I chose the safe path of academia and a postdoc position at a Research and Development center.Imposter syndrome is, funny enough, very common to PhD graduates. It is defined as “The persistent inability to believe that one’s success is deserved or has been legitimately achieved as a result of one’s own efforts or skills”. Image Source: DALL-E.Whilst working there, I realized that I need to get out of academia. I no longer had the strength to read more papers, proposals or, even more so, write journal and conference papers to showcase my work. I did all those — I had enough. I had like 7–8 journal/conference papers that got published from my PhD and I didn’t really like the fact that this is the only way to showcase my work. So, I started looking for a job in the industry.In 2021, I managed to get a job in energy consulting. And guess what? More reports, more papers and even better, PowerPoint slides! I felt like my engineering days were behind me and that I could literally do nothing useful. After a short stint at that position, I started looking for jobs again. Something with technical challenges and meaning that got my brain working. This is when I started looking for data professions where I could use the skills that I acquired throughout my career. Also, this was the time that I got the most rejections in my life!Coming from very successful bachelor and PhD degrees, I couldn’t understand why my skills were not suited for a data position. I was applying to data engineer, analyst and scientist positions but all I received was an automated reply like “Unfortunately we can’t move forward with your application”That’s when I started applying to literally everywhere. So if you are reading this because you can’t make the switch, believe me. I get you.Years 2022–2023So, I started applying everywhere to anything that even relates to data. Even to positions that I didn’t have any of the job description skills. That’s where the magic happened.I got an interview from a company in the retail sector for the position of “Commercial Intelligence Executive”. Do you know what this position is about? No? That’s right, I didn’t either. All I saw in the job description was that it required 3–5 years of experience in Data Science. So I thought, this has something to do with data, so why not. I got the job and started working there. Turns out that “Commercial Intelligence” was a job description that was basically business intelligence for the commercial department. Lucky me, it was spot on. It gave me the opportunity to start experimenting with business intelligence.Business Intelligence (BI) is defined as “The technical infrastructure that collects, stores and analyzes company data”. Photo by Carlos Muza on Unsplash.In that position, I used Power BI at first, since the role was about building reports and dashboards. Then, I was hungry for more. I was fortunate that my manager was amazing so he/she trusted me to do whatever I wanted with data. And so I did.Before I knew it, my engineering skills were back. All the problem solving skills that I got throughout the years, the bug for solving challenges and the exposure to different programming languages started connecting with each other. I started building automations in Power BI, then extended this to writing SQL to automate more things and then building data pipelines using Python. In 1 year’s time, I had all my processes pretty much automated and I knew that I had the technical capability to take on more challenging and technically intensive problems. I built incredible dashboards that brought useful insights to the business owners and that felt incredible.This was the lightbulb moment. That this career, no matter what the data is about, was what I was looking for.Years 2023-presentAfter one and a half years at the company, I knew it was time to go for something more technically challenging than just business intelligence. That’s when an opportunity turned out for me for a data engineer position and I took it.Photo by Boitumelo on Unsplash.For the past one and a half years I’ve been working in the finance sector as a data engineer. I expanded my knowledge to more things such as AI, real-time streaming data pipelines, APIs, automations and so much more. Job opportunities are coming up all the time and I feel fortunate that I have made this switch, and I couldn’t recommend it enough. Was it challenging? I’ll say that the only challenging part in both BI and data engineering positions was the first 3 months until I got to know the tools we use and the environments. My engineering expertise equipped me well to deal with different problems with excitement and do amazing things. I wouldn’t change my degree for anything else. Not even for a Computer Science degree. How did my engineering degree help throughout this transition? This is discussed in the next section.How Engineering equips you with skills that help in a data careerSo if you’ve read this far, you must be wondering: How is my engineering degree preparing me for a career in data? This guy has told me nothing about this. You are right, let’s get into it.Engineering degrees are important, not because of the discipline but the way that they structure the brains of those they study it. This is my personal opinion, but going through my engineering degrees they have exposed me to so many things and have prepared me to solve problems in every single bit that I feel much more confident now. But let’s get to the specifics. These are some key engineering skills that I see similarities and I get to use at my data role every single day:Programming: As an electrical and electronics engineer, I got exposure to multiple programming languages throughout my degrees. I used assembly language, Java, VHDL, C and Matlab. Likewise, I think other engineering disciplines do the same thing since programming is a way to perform simulations in engineering. Even though I haven’t used Python or SQL during my degrees, it was a seamless transition to these two, after getting exposed to so many things. I would even say enjoyable, since I used to hate coding during my bachelor degree, but now I love it. It probably was a matter of tight deadlines and stress from so many things at the same time.Problem Solving: I get to solve problems every day but as my first university lecturer said to us at the very first day at the university, “Google is your friend”. If you have a knack for solving problems, and you have been exposed to the way engineering projects are handed out at universities (where they basically give you a one paragraph description for the project and expect a product by the end of the week), believe me you can solve data problems. You have been through enough preparation.Math and Statistics: Engineering students get through intense mathematics such as linear algebra, calculus, statistics and others that can make you understand machine learning in a smooth transition. It’s a bit difficult to grasp at first because it’s a new territory but you’ll get the hang of it.Black Box Problems: I don’t even know if this is a formal definition but I consider “Black Box” problems to be the ones that are extremely difficult to solve, we’ve been using them, they work, but not a lot of people actually know what’s happening in the background. In data, the “Black Box Problem” is AI. It’s hot, it works and it’s amazing but no one really knows what’s happening in the background. Similarly, engineering disciplines have their own “Black Box” problems. Sure, AI is difficult but have you tried understanding the power network problem? That’s no walk in the park.Modelling and Simulations: Every engineer student has been doing modelling and simulations and that’s nothing different from ML models and data models.Data Processing and Analytics: As an engineer student in my bachelor and PhD degrees I did a lot of data processing, transformation and analytics from oscilloscope files, sensor files and smart devices that had millions of rows of data. These are examples of data pipelines as we call them in the data industry. I didn’t really know at the time though that this was the name for it. When I got to do it in a corporate environment, these skills were transferrable and helped so much.Automations: Engineers hate repeated procedures. If there’s a way to automate something, they will do it. This is the mindset that a data engineer needs. I carried this mindset to my data engineer position and it helps a lot since I spend a lot of time automating stuff in my day to day.Presenting and explaining to non-technical people: One very common thing I was doing in my PhD was explaining my project to non-technical people so that they can understand what I’m doing. This happens a lot in data. You prepare a lot of analysis for business people so you have to be able to explain it too.All the above help me every single day in my data engineer position. Can you see the transferrable skills now?So, is it a happy ending?Whilst I don’t want to encourage all the engineering disciplines to jump into a data position, I still think that all engineers are useful, I wanted to write this article to encourage the people that want to do the switch. There’s so much rejection nowadays but at the same time opportunities. All you need is the right opportunity and then magic will follow since you will be able to exploit your skills. The important thing is to keep trying.If you liked this article, please give me a few claps and follow me on https://medium.com/@loizosloizou08There’s more content to follow :)Rewiring My Career: How I Transitioned from Electrical Engineering to Data Engineering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

 

Newsletter

Join our newsletter to get the free update, insight, promotions.