Towards Data Science – Medium Your home for data science. A Medium publication sharing concepts, ideas and codes. – Medium

  • My Five Key Learnings To Be a Better Data Scientist
    by Yu Dong on June 13, 2024 at 4:38 pm

    Reflection on my six-year data science careerContinue reading on Towards Data Science »

  • Improving Business Performance with Machine Learning
    by Juan Jose Munoz on June 13, 2024 at 4:25 pm

    Whether you are a data scientist, analyst, or business analyst, your goal is to deliver projects that improve business performance.Photo by Daria Nepriakhina 🇺🇦 on UnsplashIt might be tempting to focus on the latest machine learning developments or tackling the big problems. However, you can often deliver great value by solving low-hanging fruit with simple machine-learning algorithms.Benchmarking is one of those low-hanging fruits. It is the process of measuring business KPIs against similar organizations. It allows businesses to learn from the best and continuously improve performance.There are two types of Benchmarking:1. Internal: measure KPI against units/products in the same company2. External: measure KPI against competitorsIn my daily work in the hotel industry, we often rely on third-party companies that collect hotel data for external benchmarking. However, the data we get from them is limited. On the other hand, we manage over 500 hotels and are sitting on vast amounts of data for potential benchmarking.This is the low-hanging fruit we set up to solve recently.No matter which type of benchmarking exercise you are conducting, the first step is to select a set of hotels similar to the subject hotel. In the hotel industry, we usually rely on location indicators, brand tier, number of rooms, price range, and market demand. We typically do this manually when we are doing it for one or two hotels, but doing this manually for 500 hotels is not feasible.Once you have a problem to solve, the next step is to select with tool to use. Machine learning offers many tools. However, this problem can be solved with a simple family of algorithms: Nearest Neighbors.The Nearest Neighbors algorithm familyThe nearest neighbors algorithm family is a form of optimization problem that aims to find the points in a given data set that are the closest or most similar to a given point.These algorithms have been very successful in tackling many classification and regression problems. As such, Scikit Learn API has a fantastic Nearest Neighbors module.API ReferenceChoosing the right algorithmMost people are familiar with K-Nearest Neighbor (KNN); however, Scikit Learn offers a wide variety of Nearest Neighbors algorithms, covering both supervised and unsupervised tasks.For our problem, we don’t have any labels. Therefore, we are looking for an unsupervised algorithm.If you look through the scikit learn documentation, you will find NearestNeighbors . This algorithm performs unsupervised learning for implementing neighbor searches.This seems to cover what we need to solve our problem. Let’s start by getting the data ready and running a baseline model.Baseline Model1. Loading the dataA hotel’s performance usually depends on location, brand, and size. For our analysis, we use two data sets:Hotel data: The hotel data used below has been generated artificially based on the original dataset used for this analysis.BRAND: defines the service level of the hotel: Luxury, Upscale, EconomyRoom_count: number of rooms available for saleMarket: Name of the city in which the hotel is locatedCountry: Name of the countryLatitude: Hotel’s Latitude locationLongitude: Hotel’s Longitude locationAirport Code: 3 Letter code of the nearest international airportMarket Tier: defines the market development level.HCLASS: indicates if the hotel is a city hotel or resortDemand: indicates hotel yearly occupancyPrice range: indicates the average price for the hotelWe also know how hotel performance can be impacted by accessibility. To measure accessibility, we can measure how far the hotel is from the main international airport. The airport data is from the World Bank: 3 Letter Airport codeName: Airport nameTotalSeats: Annual passenger volumeCountry name: Airport country nameAirpot1Latitude: Aiport LatitudeAirport1Longitude: Airport Longitude*Global Airports dataset is licensed under Creative Commons Attribution 4.0Let’s import the data.import pandas as pdimport numpy as npdata = pd.read_excel(“mock_data.xlsx”)airport_data = pd.read_csv(“airport_volume_airport_locations.csv”)Sample of Hotel data. Image by authorSample Airport data. Image by authorAs mentioned before, hotel performance is highly dependent on location. In our data set, we have many measures of location, such as Market Country… however, this is not always ideal as those definitions are quite broad. To narrow down similar locations, we need to create a accessability measure, defined by the distance to the closest international airport.To calculate the distance from a hotel to the airport, we use the haversine formula. The haversine formula is used to calculate the distance between two points in a sphere, given their latitude and longitude.# Below code is taken from geeksforgeeksfrom math import radians, cos, sin, asin, sqrtdef distance_to_airport(lat, airport_lat, lon, airport_lon): # Convert latitude and longitude values from decimal degrees to radians lon = radians(lon) airport_lon = radians(airport_lon) lat = radians(lat) airport_lat = radians(airport_lat) # Haversine formula dlon = airport_lon – lon dlat = airport_lat – lat a = sin(dlat / 2)**2 + cos(lat) * cos(airport_lat) * sin(dlon / 2)**2 c = 2 * asin(sqrt(a)) # Radius of earth in kilometers. r = 6371 # return distance in KM return(c * r)#Apply the distance_to_airport functions to each hoteldata[“distance_to_airport”] = data.apply(lambda row: distance_to_airport(row[“Latitude”],row[“Airport1Latitude”],row[“Longitude”],row[“Airport1Longitude”]),axis=1)data.head()Resulting data frame with distance to airport feature. Image by authorThe next step is removing any column we won’t need for our model.# Drop Columns that we dont need# For the purpose of benchmarking we will keep the hotel feautures, and distance to airportcol_to_drop = [“Latitude”,”Longitude”,”Airport Code”,”Orig”,”Name”,”TotalSeats”,”Country Name”,”Airport1Latitude”,”Airport1Longitude”]data_clean = data.drop(col_to_drop,axis=1)data_clean.head()Next, we encode all non-numerical variables so that we can pass them into our model. At this point, it is important to keep in mind that we will need the original labels to present our suggested groupings to the team and for ease of validation. To do so, we will store the encoding information in a dictionary.from sklearn.preprocessing import LabelEncoder# Create a LabelEncoder object for each object columnbrand_encoder = LabelEncoder()market_encoder = LabelEncoder()country_encoder = LabelEncoder()market_tier_encoder = LabelEncoder()hclass_encoder = LabelEncoder()# Fit each LabelEncoder on the unique values of the corresponding columndata_clean[‘BRAND’] = brand_encoder.fit_transform(data_clean[‘BRAND’])data_clean[‘Market’] = market_encoder.fit_transform(data_clean[‘Market’])data_clean[‘Country’] = country_encoder.fit_transform(data_clean[‘Country’])data_clean[‘Market Tier’] = market_tier_encoder.fit_transform(data_clean[‘Market Tier’])data_clean[‘HCLASS’]= hclass_encoder.fit_transform(data_clean[‘HCLASS’])# create a dictionnary with all the encoders for reverse encodingencoders ={“BRAND” : brand_encoder, “Market”: market_encoder, “Country”: country_encoder, “Market Tier”: market_tier_encoder, “HCLASS”: hclass_encoder}data_clean.head()Encoded data. Image by authorOur data is now numerical, but as you can see the values in each column have very different ranges. To avoid the ranges of any features from disproportionately affecting our model, we need to rescale our data.from sklearn.preprocessing import StandardScalerscaler = StandardScaler()data_scaled = scaler.fit_transform(data_clean)data_scaledScaled data. Image by authorAt this point, we are ready to generate a baseline model.from sklearn.neighbors import NearestNeighborsnns = NearestNeighbors() = nns.kneighbors(data_scaled)[1]nns_results_model_0Model output. Image by authorThe output of the model is a list of indexes, where the first index is the subject hotel, and the other indexes represent the nearest hotels.To validate the model, we can visually inspect the results. We can create a function that takes in the list of indexes and decodes the values.def clean_results(nns_results: np.ndarray, encoders: dict, data: pd.DataFrame): “”” Returns a dataframe with a list of labels for each Nearest Neighobor group “”” result = pd.DataFrame() # 1. Get a list of Nearest Hotels based on our model for i in range(len(nns_results)): results = {} #empty dictionary to append each rows values # Each row in nns_results contains the indexs of the selected nearest neighbors # We use those index to get the Hotel names in our main data set results[“Hotels”] = list(data.iloc[nns_results[i]].index) # 2. Get the values for each features for all Nearest Neighbors groups for item in data_clean.columns: results[item] = list(data.iloc[nns_results[i]][item]) # 3. Create a row for each Nearest Neighbor group and append to main DataFrame df = pd.DataFrame([results]) result = pd.concat([result,df],axis=0) # 4. Decode the labels to the encoded columns for key, val in encoders.items(): result[key] = result[key].apply(lambda x : list(val.inverse_transform(x))) result.reset_index(drop=True,inplace=True) # Reset the index for clarity return resultresults_model_0 = clean_results(nns_results=nns_results_model_0, encoders=encoders, data=data_clean)results_model_0.head()Initial benchmark groups. Image by authorBecause we are using an unsupervised learning algorithm, there is not a widely available measure of accuracy. However, we can use domain knowledge to validate our groups.Visually inspecting the groups, we can see some benchmarking groups have a mix of Economy and Luxury hotels, which doesn’t make business sense as the demand for hotels is fundamentally different.We can scroll to the data and note some of those differences, but can we come up with our own accuracy measure?We want to create a function to measure the consistency of the recommended Benchmarking sets across each feature. One way of doing this is by calculating the variance in each feature for each set. For each cluster, we can compute an average of each feature variance, and we can then average each hotel cluster variance to get a total model score.From our domain knowledge, we know that in order to set up a comparable benchmark set, we need to prioritize hotels in the same Brand, possibly the same market, and the same country, and if we use different markets or countries, then the market tier should be the same.With that in mind, we want our measure to have a higher penalty for variance in those features. To do so, we will use a weighted average to calculate each benchmark set variance. We will also print the variance of the key features and secondary features separately.To sum up, to create our accuracy measure, we need to:Calculate variance for categorical variables: One common approach is to use an “entropy-based” measure, where higher diversity in categories indicates higher entropy (variance).Calculate variance for numerical variables: we can compute the standard deviation or the range (difference between maximum and minimum values). This measures the spread of numerical data within each cluster.Normalize the data: normalize the variance scores for each category before applying weights to ensure that no single feature dominates the weighted average due to scale differences alone.Apply weights for different metrics: Weight each type of variance based on its importance to the clustering logic.Calculating weighted averages: Compute the weighted average of these variance scores for each cluster.Aggregating scores across clusters: The total score is the average of these weighted variance scores across all clusters or rows. A lower average score would indicate that our model effectively groups similar hotels together, minimizing intra-cluster variance.from scipy.stats import entropyfrom sklearn.preprocessing import MinMaxScalerfrom collections import Counterdef categorical_variance(data): “”” Calculate entropy for a categorical variable from a list. A higher entropy value indicates datas with diverse classes. A lower entropy value indicates a more homogeneous subset of data. “”” # Count frequency of each unique value value_counts = Counter(data) total_count = sum(value_counts.values()) probabilities = [count / total_count for count in value_counts.values()] return entropy(probabilities)#set scoring weights giving higher weights to the most important featuresscoring_weights = {“BRAND”: 0.3, “Room_count”: 0.025, “Market”: 0.25, “Country”: 0.15, “Market Tier”: 0.15, “HCLASS”: 0.05, “Demand”: 0.025, “Price range”: 0.025, “distance_to_airport”: 0.025}def calculate_weighted_variance(df, weights): “”” Calculate the weighted variance score for clusters in the dataset “”” # Initialize a DataFrame to store the variances variance_df = pd.DataFrame() # 1. Calculate variances for numerical features numerical_features = [‘Room_count’, ‘Demand’, ‘Price range’, ‘distance_to_airport’] for feature in numerical_features: variance_df[f'{feature}’] = df[feature].apply(np.var) # 2. Calculate entropy for categorical features categorical_features = [‘BRAND’, ‘Market’,’Country’,’Market Tier’,’HCLASS’] for feature in categorical_features: variance_df[f'{feature}’] = df[feature].apply(categorical_variance) # 3. Normalize the variance and entropy values scaler = MinMaxScaler() normalized_variances = pd.DataFrame(scaler.fit_transform(variance_df), columns=variance_df.columns, index=variance_df.index) # 4. Compute weighted average cat_weights = {f'{feature}’: weights[f'{feature}’] for feature in categorical_features} num_weights = {f'{feature}’: weights[f'{feature}’] for feature in numerical_features} cat_weighted_scores = normalized_variances[categorical_features].mul(cat_weights) df[‘cat_weighted_variance_score’] = cat_weighted_scores.sum(axis=1) num_weighted_scores = normalized_variances[numerical_features].mul(num_weights) df[‘num_weighted_variance_score’] = num_weighted_scores.sum(axis=1) return df[‘cat_weighted_variance_score’].mean(), df[‘num_weighted_variance_score’].mean()To keep our code clean and track our experiments , let’s also define a function to store the results of our experiments.# define a function to store the results of our experimentsdef model_score(data: pd.DataFrame, weights: dict = scoring_weights, model_name: str =”model_0″): cat_score,num_score = calculate_weighted_variance(data,weights) results ={“Model”: model_name, “Primary features score”: cat_score, “Secondary features score”: num_score} return resultsmodel_0_score= model_score(results_model_0,scoring_weights)model_0_scoreBaseline model results.Now that we have a baseline, let’s see if we can improve our model.Improving our Model Through ExperimentationUp until now, we did not have to know what was going on under the hood when we ran this code:nns = NearestNeighbors() = nns.kneighbors(data_scaled)[1]To improve our model, we will need to understand the model parameters and how we can interact with them to get better benchmark sets.Let’s start by looking at the Scikit Learn documentation and source code:# the below is taken directly from scikit learn sourcefrom sklearn.neighbors._base import KNeighborsMixin, NeighborsBase, RadiusNeighborsMixinclass NearestNeighbors_(KNeighborsMixin, RadiusNeighborsMixin, NeighborsBase): “””Unsupervised learner for implementing neighbor searches. Parameters ———- n_neighbors : int, default=5 Number of neighbors to use by default for :meth:`kneighbors` queries. radius : float, default=1.0 Range of parameter space to use by default for :meth:`radius_neighbors` queries. algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’ Algorithm used to compute the nearest neighbors: – ‘ball_tree’ will use :class:`BallTree` – ‘kd_tree’ will use :class:`KDTree` – ‘brute’ will use a brute-force search. – ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force. leaf_size : int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. metric : str or callable, default=’minkowski’ Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2. See the documentation of `scipy.spatial.distance <>`_ and the metrics listed in :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric values. p : float (positive), default=2 Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. metric_params : dict, default=None Additional keyword arguments for the metric function. “”” def __init__( self, *, n_neighbors=5, radius=1.0, algorithm=”auto”, leaf_size=30, metric=”minkowski”, p=2, metric_params=None, n_jobs=None, ): super().__init__( n_neighbors=n_neighbors, radius=radius, algorithm=algorithm, leaf_size=leaf_size, metric=metric, p=p, metric_params=metric_params, n_jobs=n_jobs, )There are quite a few things going on here.The Nearestneighbor class inherits fromNeighborsBase, which is the case class for nearest neighbor estimators. This class handles the common functionalities required for nearest-neighbor searches, such asn_neighbors (the number of neighbors to use)radius (the radius for radius-based neighbor searches)algorithm (the algorithm used to compute the nearest neighbors, such as ‘ball_tree’, ‘kd_tree’, or ‘brute’)metric (the distance metric to use)metric_params (additional keyword arguments for the metric function)The Nearestneighbor class also inherits fromKNeighborsMixin and RadiusNeighborsMixinclasses. These Mixin classes add specific neighbor-search functionalities to the NearestneighborKNeighborsMixin provides functionality to find the nearest fixed number k of neighbors to a point. It does that by finding the distance to the neighbors and their indices and constructing a graph of connections between points based on the k-nearest neighbors of each point.RadiusNeighborsMixin is based on the radius neighbors algorithm, which finds all neighbors within a given radius of a point. This method is useful in scenarios where the focus is on capturing all points within a meaningful distance threshold rather than a fixed number of points.Based on our scenario, KNeighborsMixin provides the functionality we need.We need to understand one key parameter before we can improve our model; this is the distance metric.A quick introduction to distanceThe documentation mentions that the NearestNeighbor algorithm uses the “Minkowski” distance by default and gives us a reference to the SciPy API.In scipy.spatial.distance, we can see two mathematical representations of “Minkowski” distance:∥u−v∥ p​=( i ∑​∣u i​−v i​∣ p ) 1/pThis formula calculates the p-th root of the sum of powered differences across all elements.The second mathematical representation of “Minkowski” distance is:∥u−v∥ p​=( i ∑​w i​(∣u i​−v i​∣ p )) 1/pThis is very similar to the first one, but it introduces weights wi to the differences, emphasizing or de-emphasizing specific dimensions. This is useful where certain features are more relevant than others. By default, the setting is None, which gives all features the same weight of 1.0.This is a great option for improving our model as it allows us to pass domain knowledge to our model and emphasize similarities that are most relevant to users.If we look at the formulas, we see the parameter. p. This parameter affects the “path” the algorithm takes to calculate the distance. By default, p=2, which represents the Euclidian distance.You can think of the Euclidian distance as calculating the distance by drawing a straight line between 2 points. This is usally the shortest distance, however, this is not always the most desirable way of calculating the distance, specially in higher dimention spaces. For more information on why this is the case, there is this great paper online: common value for p is 1. This represents the Manhattan distance. You think of it as the distance between two points measured along a grid-like path.On the other hand, if we increase p towards infinity, we end up with the Chebyshev distance, defined as the maximum absolute difference between any corresponding elements of the vectors. It essentially measures the worst-case difference, making it useful in scenarios where you want to ensure that no single feature varies too much.By reading and getting familiar with the documentation, we have uncovered a few possible options to improve our model.Experiment 1: Baseline model with n_neighbors = 4By default n_neighbors is 5, however, for our benchmark set, we want to compare each hotel to the 3 most similar hotels. To do so, we need to set n_neighbors = 4 (Subject hotel + 3 peers)nns_1= NearestNeighbors(n_neighbors=4) = nns_1.kneighbors(data_scaled)[1]results_model_1 = clean_results(nns_results=nns_1_results_model_1, encoders=encoders, data=data_clean)model_1_score= model_score(results_model_1,scoring_weights,model_name=”baseline_k_4″)model_1_scoreSlight improvement in our primary features. Image by authorExperiment 2: adding weightsBased on the documentation, we can pass weights to the distance calculation to emphasize the relationship across some features. Based on our domain knowledge, we have identified the features we want to emphasize, in this case, Brand, Market, Country, and Market Tier.# set up weights for distance calculationweights_dict = {“BRAND”: 5, “Room_count”: 2, “Market”: 4, “Country”: 3, “Market Tier”: 3, “HCLASS”: 1.5, “Demand”: 1, “Price range”: 1, “distance_to_airport”: 1}# Transform the wieghts dictionnary into a list by keeping the scaled data column orderweights = [ weights_dict[idx] for idx in list(scaler.get_feature_names_out())]nns_2= NearestNeighbors(n_neighbors=4,metric_params={ ‘w’: weights}) = nns_2.kneighbors(data_scaled)[1]results_model_2 = clean_results(nns_results=nns_2_results_model_2, encoders=encoders, data=data_clean)model_2_score= model_score(results_model_2,scoring_weights,model_name=”baseline_with_weights”)model_2_scorePrimary features score keeps improving. Image by authorPassing domain knowledge to the model via weights increased the score significantly. Next, let’s test the impact of the distance measure.Experiment 3: use Manhattan distanceSo far, we have been using the Euclidian distance. Let’s see what happens if we use the Manhattan distance instead.nns_3= NearestNeighbors(n_neighbors=4,p=1,metric_params={ ‘w’: weights}) = nns_3.kneighbors(data_scaled)[1]results_model_3 = clean_results(nns_results=nns_3_results_model_3, encoders=encoders, data=data_clean)model_3_score= model_score(results_model_3,scoring_weights,model_name=”Manhattan_with_weights”)model_3_scoreSignificant decrease in primary score. image by authorExperiment 4: use Chebyshev distanceDecreasing p to 1 resulted in some good improvements. Let’s see what happens as p approximates infinity.To use the Chebyshev distance, we will change the metric parameter to Chebyshev. The default sklearn Chebyshev metric doesn’t have a weight parameter. To get around this, we will define a custom weighted_chebyshev metric.# Define the custom weighted Chebyshev distance functiondef weighted_chebyshev(u, v, w): “””Calculate the weighted Chebyshev distance between two points.””” return np.max(w * np.abs(u – v))nns_4 = NearestNeighbors(n_neighbors=4,metric=weighted_chebyshev,metric_params={ ‘w’: weights}) = nns_4.kneighbors(data_scaled)[1]results_model_4 = clean_results(nns_results=nns_4_results_model_4, encoders=encoders, data=data_clean)model_4_score= model_score(results_model_4,scoring_weights,model_name=”Chebyshev_with_weights”)model_4_scoreBetter than the baseline but higher than the previous experiment. Image by authorWe managed to decrease the primary feature variance scores through experimentation.Let’s visualize the results.results_df = pd.DataFrame([model_0_score,model_1_score,model_2_score,model_3_score,model_4_score]).set_index(“Model”)results_df.plot(kind=’barh’)Experimentation results. Image by authorUsing Manhattan distance with weights seems to give the most accurate benchmark sets according to our needs.The last step before implementing the benchmark sets would be to examine the sets with the highest Primary features scores and identify what steps to take with them.# Histogram of Primary features scoreresults_model_3[“cat_weighted_variance_score”].plot(kind=”hist”)Score distribution. Image by authorexceptions = results_model_3[results_model_3[“cat_weighted_variance_score”]>=0.4]print(f” There are {exceptions.shape[0]} benchmark sets with significant variance across the primary features”)Image by authorThese 18 cases will need to be reviewed to ensure the benchmark sets are relevant.As you can see, with a few lines of code and some understanding of Nearest neighbor search, we managed to set internal benchmark sets. We can now distribute the sets and start measuring hotels’ KPIs against their benchmark sets.You don’t always have to focus on the most cutting-edge machine learning methods to deliver value. Very often, simple machine learning can deliver great value.What are some low-hanging fruits in your business that you could easily tackle with Machine learning?REFERENCESWorld Bank. “World Development Indicators.” Retrieved June 11, 2024, from, C. C., Hinneburg, A., & Keim, D. A. (n.d.). On the Surprising Behavior of Distance Metrics in High Dimensional Space. IBM T. J. Watson Research Center and Institute of Computer Science, University of Halle. Retrieved from v1.10.1 Manual. scipy.spatial.distance.minkowski. Retrieved June 11, 2024, from Haversine formula to find distance between two points on a sphere. Retrieved June 11, 2024, from Neighbors Module. Retrieved June 11, 2024, from Business Performance with Machine Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • The Skill That Holds Back (Most) Data Scientists
    by Shaw Talebi on June 13, 2024 at 4:18 pm

    7 communication tips that made me a better data scientistWhen starting in data science, I was hyper-focused on learning Python, Machine Learning, Statistics, etc. While these are necessary, there is another skill that most tend to overlook at their peril—communication. In this article, I discuss why data scientists must be effective communicators and share 7 tips that have helped me improve this skill set.Photo by Mitchel Lensink on UnsplashBack in grad school, the Physics department hosted weekly colloquia, where guest speakers would come to present their research. The typical story was that most of the audience understood the first slide (title) and maybe the second (agenda) but got lost after that.The same happens in data science when non-technical stakeholders sit through presentations from (most) data scientists. “It made sense until you started talking about train-test splits and AUCs,” they might say.While this might seem like an unavoidable reality of data science, I’ve learned that explaining these topics more effectively is not only possible but essential for advancing a data science career.Here, I share the key communication tips I’ve used to get promoted, land clients, and explain AI to 100k+ people.Why Communication MattersThe importance of communication may shock some and be met with some resistance. So, allow me to explain this a little more.Data scientists don’t typically solve their problems; rather, they solve other people’s problems (i.e., stakeholders). This is how data scientists generate value in a business context.Therefore, the amount of value a data scientist provides is directly proportional to how effectively they can collaborate with their non-technical stakeholders. To put it plainly, if stakeholders don’t understand and adopt your solution, it provides zero value.You can learn thisSome might think that communication is one of those skills you either have or don’t have. This, of course, is false. Communication (like any other skill) can and must be developed through practice.For instance, I started this journey as an overly technical physics grad student, but after 5 years of actively giving presentations, writing articles, making YouTube videos, hosting events, interviewing entrepreneurs, and doing technical consultations, I now get praised (and paid) for my communication skills. If I can do it, you can too.7 Communication Tips for Data ScientistsThe following are the communication tips I use most often. Although I’m focusing on technical presentations here, these tips broadly apply to conversations, writing, and beyond.An upside of developing this skill as a data scientist is that the bar is so low that even becoming a decent communicator can put you ahead of most of your peers (I am living proof of that).Tip 1: Use StoriesThe most powerful way to communicate is through storytelling. Our brains are wired for stories [1]. So, the more you can use them, the better.When I say “story,” you might think of the textbook definition, i.e., an account of imaginary or real people and events [2]. However, I mean it in a broader sense, which I picked up from the book The Storyteller’s Secret [1].There, the author defines a story as any 3-part narrative. Some examples of this are:Status quo. Problem. Solution.What? Why? How?What? So what? What now?Here’s what the first example above looks like in action.AI has taken the business world by storm (status quo). While its potential is clear, translating it into specific value-generating business applications remains a challenge (problem). Here I discuss 5 AI success stories to help guide its development in your business (solution).Tip 2: Use ExamplesData science is full of abstract ideas that bear little resemblance to our daily lives (e.g., features, overfitting, curse of dimensionality). A powerful way to make these abstract ideas relatable is through specific examples.Let’s demonstrate the power of examples by example. Suppose a stakeholder asks you, “What’s a feature?”Your instinct might be to give a definition, i.e., “Features are what we use to make predictions.” However, this is a pretty vague statement.A simple way to clarify this is by following up the general definition with a specific example like, “For example, the features in our customer churn are Account Age and Number of Logins in Past 90 Days.”Tip 3: Use AnalogiesAlthough examples are powerful, sometimes they don’t get the job done. This is where analogies come in. Analogies are powerful because they map the familiar to the unfamiliar.For instance, the other day, I found myself explaining Mechanistic Interpretability to a non-technical client. This is a big, scary term (even for data scientists), so here’s how I explained it.Modern AIs like ChatGPT are powerful, but we don’t really know how they work under the hood. The idea with Mechanistic Interpretability is to look under the hood to find out what different parts of the model do.By comparing an LLM (unfamiliar) to a car engine (familiar), this abstract concept becomes much more digestible.Tip 4: Numbered ListsIn a sea of ideas and words, numbers tend to stand out. This makes them an effective way to convey information.For example, I’m using this technique to structure the 7 communication tips in this article. However, this goes beyond the typical internet listicle you might see.Another way to use numbered lists is when making multiple points in the flow of communication. For example, I want to make 2 points here: 1) numbers stand out to us, and 2) they provide a clear way to structure information.The reason (IMO) this works so well is becuase numbers like 1, 2, 3, etc., are such basic and familiar concepts that they require little cognitive effort to process.Tip 5: Less is More“I didn’t have time to write a short letter, so I wrote a long one instead.” — Mark TwainThis is the most fundamental principle of effective communication. Your audience only has a finite amount of attention to give you. Therefore, as communicators, we need to be economical when spending our audience’s attention.While you might think fewer words mean less time, the opposite is often true. Distilling ideas down to the most essential takes many iterations.This can mean cutting down the number of slides in a presentation, the number of elements on each slide, and even the number of characters used in the title.Here are some heuristics I use in a business context:Keep talks 20 min or less (~10 slides or less)Don’t have more than 3–5 elements per slideMake bullet points as short as possible (down to the character)Tip 6: Show Don’t TellA corollary of less is more is pictures over words. It takes more from our brains to process text than images, so conveying ideas through pictures is an unreasonably effective way to preserve people’s limited attention while still making the point.Here is the fine-tuning analogy from Tip 3, compared to a visual representation of the same idea.Fine-tuning analogy in words vs images. Image by author.This highlights the power of data visualizations. Although this topic deserves a dedicated article, it shares the foundational principle of less is more.Tip 7: Slow DownThis final tip was a game-changer for me. Before, I tended to rush through presentations. This was likely a result of nerves and just trying to get it over with. Eventually, however, I realized the nerves would naturally subside by slowing down my pace and using a calmer tone.Slowing down has the added benefit of improving the audience’s experience. A rush talk can feel like getting blasted by a firehouse, while a well-paced one is like a soothing stream. Consequently, a short, rushed talk is more painful than a long, well-paced one.A rushed talk vs a well-paced one. Image by author.Bonus: Know Thy AudienceWhile the tips above can yield quick improvements to one’s communication, their impact will be limited if the communication is not tailored to the audience. This highlights the importance of empathy.Empathy means seeing things from someone else’s perspective. It is essential for effective communication because it provides the context for framing all aspects of your presentation.The more you can put yourself in the audience’s shoes, the more effectively you can speak to what they care about and understand.ConclusionMost data scientists’ limiting factor is not their technical skills but their ability to communicate effectively. Developing this skill is one of the best ways for data science professionals to advance their careers and make a greater impact.Here, I shared 7 tips that have been most helpful to me in improving my communication. If you have tips that have helped you, drop them in the comments :)ResourcesConnect: My website | Book a callSocials: YouTube 🎥 | LinkedIn | TwitterSupport: Buy me a coffee ☕️The Data Entrepreneurs[1] The Storyteller’s Secret: From TED Speakers to Business Legends, Why Some Ideas Catch On and Others Don’t by Carmine Gallo[2] Oxford Languages. (2024). Story. Retrieved June 11, 2024, from Skill That Holds Back (Most) Data Scientists was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • Take a Look Under the hood
    by Dorian Drost on June 13, 2024 at 4:14 pm

    Take a Look Under the HoodUsing Monosemanticity to understand the concepts a Large Language Model learnedJust like with the brain, it is quite hard to understand, what is really happening inside an LLM. Photo by Robina Weermeijer on UnsplashWith the increasing use of Large Language Models (LLMs), the need for understanding their reasoning and behavior increases as well. In this article, I want to present to you an approach that sheds some light on the concepts an LLM represents internally. In this approach, a representation is extracted that allows one to understand a model’s activation in terms of discrete concepts being used for a given input. This is called Monosemanticity, indicating that these concepts have just a single (mono) meaning (semantic).In this article, I will first describe the main idea behind Monosemanticity. For that, I will explain sparse autoencoders, which are a core mechanism within the approach, and show how they are used to structure an LLM’s activation in an interpretable way. Then I will retrace some demonstrations the authors of the Monosemanticity approach proposed to explain the insights of their approach, which closely follows their original publication.Sparse autoencodersJust like an hourglass, an autoencoder has a bottleneck the data must pass through. Photo by Alexandar Todov on UnsplashWe have to start by taking a look at sparse autoencoders. First of all, an autoencoder is a neural net that is trained to reproduce a given input, i.e. it is supposed to produce exactly the vector it was given. Now you wonder, what’s the point? The important detail is, that the autoencoder has intermediate layers that are smaller than the input and output. Passing information through these layers necessarily leads to a loss of information and hence the model is not able to just learn the element by heart and reproduce it fully. It has to pass the information through a bottleneck and hence needs to come up with a dense representation of the input that still allows it to reproduce it as well as possible. The first half of the model we call the encoder (from input to bottleneck) and the second half we call the decoder (from bottleneck to output). After having trained the model, you may throw away the decoder. The encoder now transforms a given input into a representation that keeps important information but has a different structure than the input and potentially removes unneeded parts of the data.To make an autoencoder sparse, its objective is extended. Besides reconstructing the input as well as possible, the model is also encouraged to activate as few neurons as possible. Instead of using all the neurons a little, it should focus on using just a few of them but with a high activation. This also allows to have more neurons in total, making the bottleneck disappear in the model’s architecture. However, the fact that activating too many neurons is punished still keeps the idea of compressing the data as much as possible. The neurons that are activated are then expected to represent important concepts that describe the data in a meaningful way. We call them features from now on.In the original Monosemanticity publication, such a sparse autoencoder is trained on an intermediate layer in the middle of the Claude 3 Sonnet model (an LLM published by Anthropic that can be said to play in the same league as the GPT models from OpenAI). That is, you can take some tokens (i.e. text snippets), forward them to the first half of the Claude 3 Sonnett model, and forward that activation to the sparse autoencoder. You will then get an activation of the features that represent the input. However, we don’t really know what these features mean so far. To find out, let’s imagine we feed the following texts to the model:The cat is chasing the dog.My cat is lying on the couch all day long.I don’t have a cat.If there is one feature that activates for all three of the sentences, you may guess that this feature represents the idea of a cat. There may be other features though, that just activate for single sentences but not for the others. For sentence one, you would expect the feature for dog to be activated, and to represent the meaning of sentence three, you would expect a feature that represents some form of negation or “not having something”.Different featuresFeatures can describe quite different things, from apples and bananas to the notion of being edible and tasting sweet. Photo by Jonas Kakaroto on UnsplashFrom the aforementioned example, we saw that features can describe quite different things. There may be features that represent concrete objects or entities (such as cats, the Eiffel Tower, or Benedict Cumberbatch), but there may also be features dedicated to more abstract concepts like sadness, gender, revolution, lying, things that can melt or the german letter ß (yes, we indeed have an additional letter just for ourselves). As the model also saw programming code during its training, it also includes many features that are related to programming languages, representing contexts such as code errors or computational functions. You can explore the features of the Claude 3 model here.If the model is capable of speaking multiple languages, the features are found to be multilingual. That means, a feature that corresponds to, say, the concept of sorrow, would be activated in relevant sentences in each language. In a likewise fashion, the features are also multimodal, if the model is able to work with different input modalities. The Benedict Cumberbatch feature would then activate for the name, but also for pictures or verbal mentions of Benedict Cumberbatch.Influence on behaviorFeatures can influence behavior, just like a steering wheel influences the way you take. Photo by Niklas Garnholz on UnsplashSo far we have seen that certain features are activated when the model produces a certain output. From a model’s perspective, the direction of causality is the other way round though. If the feature for the Golden Gate Bridge is activated, this causes the model to produce an answer that is related to this feature’s concept. In the following, this is demonstrated by artificially increasing the activation of a feature within the model’s inference.Answers of the model being influenced by a high activation of a certain feature. Image taken from the original publication.On the left, we see the answers to two questions in the normal setup, and on the right we see, how these answers change if the activation of the features Golden Gate Bridge (first row) and brain sciences (second row) are increased. It is quite intuitive, that activating these features makes the model produce texts that include the concepts of the Golden Gate Bridge and brain sciences. In the usual case, the features are activated from the model’s input and its prompt, but with the approach we saw here, one can also activate some features in a more deliberate and explicit way. You could think of always activating the politeness feature to steer the model’s answers in the desired way. Without the notion of features, you would do that by adding instructions to the prompt such as “always be polite in your answers”, but with the feature concept, this could be done more explicitly. On the other hand, you can also think of deactivating features explicitly to avoid the model telling you how to build an atomic bomb or conduct tax fraud.Taking a deeper look: Specificity, Sensitivity and CompletenessLet’s observe the features in more detail. Photo by K8 on UnsplashNow that we have understood how the features are extracted, we can follow some of the author’s experiments that show us which features and concepts the model actually learned.First, we want to know how specific the features are, i.e. how well they stick to their exact concept. We may ask, does the feature that represents Benedict Cumberbatch indeed activate only for Benedict Cumberbatch and not for other actors? To shed some light on this question, the authors used an LLM to rate texts regarding their relevance to a given concept. In the following example, it was assessed how much a text relates to the concept of brain science on a scale from 0 (completely irrelevant) to 3 (very relevant). In the next figure, we see these ratings as the colors (blue for 0, red for 3) and we see the activation level on the x-axis. The more we go to the right, the more the feature is activated.The activation of the feature for brain science together with relevance scores of the inputs. Image taken from the original publication.We see a clear correlation between the activation (x-axis) and the relevance (color). The higher the activation, the more often the text is considered highly relevant to the topic of brain sciences. The other way round, for texts that are of little or no relevance to the topic of brain sciences, the feature only activates marginally (if at all). That means, that the feature is quite specific for the topic of brain science and does not activate that much for related topics such as psychology or medicine.SensitivityThe other side of the coin to specificity is sensitivity. We just saw an example, of how a feature activates only for its topic and not for related topics (at least not so much), which is the specificity. Sensitivity now asks the question “but does it activate for every mention of the topic?” In general, you can easily have the one without the other. A feature may only activate for the topic of brain science (high specificity), but it may miss the topic in many sentences (low sensitivity).The authors spend less effort on the investigation of sensitivity. However, there is a demonstration that is quite easy to understand: The feature for the Golden Gate Bridge activates for sentences on that topic in many different languages, even without the explicit mention of the English term “Golden Gate Bridge”. More fine-grained analyses are quite difficult here because it is not always clear what a feature is supposed to represent in detail. Say you have a feature that you think represents Benedict Cumberbatch. Now you find out, that it is very specific (reacting to Benedict Cumberbatch only), but only reacts to some — not all — pictures. How can you know, if the feature is just insensitive, or if it is rather a feature for a more fine-grained subconcept such as Sherlock from the BBC series (played by Benedict Cumberbatch)?CompletenessIn addition to the features’ activation for their concepts (specificity and sensitivity), you may wonder if the model has features for all important concepts. It is quite difficult to decide which concepts it should have though. Do you really need a feature for Benedict Cumberbatch? Are “sadness” and “feeling sad” two different features? Is “misbehaving” a feature on its own, or can it be represented by the combination of the features for “behaving” and “negation”?To catch a glance at the feature completeness, the authors selected some categories of concepts that have a limited number such as the elements in the periodic table. In the following figure, we see all the elements on the x-axis and we see whether a corresponding feature has been found for three different sizes of the autoencoder model (from 1 million to 34 million parameters).Elements of the periodic table having a feature in the autoencoders of different sizes. Image taken from original publication.It is not surprising, that the biggest autoencoder has features for more different elements of the periodic table than the smaller ones. However, it also doesn’t catch all of them. We don’t know though, if this really means, that the model does not have a clear concept of, say, Bohrium, or if it just did not survive within the autoencoder.LimitationsWhile we saw some demonstrations of the features representing the concepts the model learned, we have to emphasize that these were in fact qualitative demonstrations and not quantitative evaluations. All the examples were great to get an idea of what the model actually learned and to demonstrate the usefulness of the Monosemanticity approach. However, a formal evaluation that assesses all the features in a systematic way is needed, to really backen the insights gained from such investigations. That is easy to say and hard to conduct, as it is not clear, how such an evaluation could look like. Future research is needed to find ways to underpin such demonstrations with quantitative and systematic data.SummaryMonosemanticity is an interesting path, but we don’t yet know where it will lead us. Photo by Ksenia Kudelkina on UnsplashWe just saw an approach that allows to gain some insights into the concepts a Large Language Model may leverage to arrive at its answers. A number of demonstrations showed how the features extracted with a sparse autoencoder can be interpreted in a quite intuitive way. This promises a new way to understand Large Language Models. If you know that the model has a feature for the concept of lying, you could expect it do to so, and having a concept of politeness (vs. not having it) can influence its answers quite a lot. For a given input, the features can also be used to understand the model’s thought traces. When asking a model to tell a story, the activation of the feature happy end may explain how it comes to a certain ending, and when the model does your tax declaration, you may want to know if the concept of fraud is activated or not.As we see, there is quite some potential to understand LLMs in more detail. A more formal and systematical evaluation of the features is needed though, to back the promises this format of analysis introduces.SourcesThis article is based on this publication, where the Monosemanticity approach is applied to an LLM:Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetThere is also a previous work that introduces the core ideas in a more basic model:Towards Monosemanticity: Decomposing Language Models With Dictionary LearningFor the Claude 3 model that has been analyzed, see here: features can be explored here: this article? Follow me to be notified of my future posts.Take a Look Under the hood was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • Sparse Autoencoders, Additive Decision Trees, and Other Emerging Topics in AI Interpretability
    by TDS Editors on June 13, 2024 at 1:32 pm

    Feeling inspired to write your first TDS post? We’re always open to contributions from new authors.As LLMs get bigger and AI applications more powerful, the quest to better understand their inner workings becomes harder — and more acute. Conversations around the risks of black-box models aren’t exactly new, but as the footprint of AI-powered tools continues to grow, and as hallucinations and other suboptimal outputs make their way into browsers and UIs with alarming frequency, it’s more important than ever for practitioners (and end users) to resist the temptation to accept AI-generated content at face value.Our lineup of weekly highlights digs deep into the problem of model interpretability and explainability in the age of widespread LLM use. From detailed analyses of an influential new paper to hands-on experiments with other recent techniques, we hope you take some time to explore this ever-crucial topic.Deep Dive into Anthropic’s Sparse Autoencoders by HandWithin a few short weeks, Anthropic’s “Scaling Monosemanticity” paper has attracted a lot of attention within the XAI community. Srijanie Dey, PhD presents a beginner-friendly primer for anyone interested in the researchers’ claims and goals, and in how they came up with an “innovative approach to understanding how different components in a neural network interact with one another and what role each component plays.”Interpretable Features in Large Language ModelsFor a high-level, well-illustrated explainer on the “Scaling Monosemanticity” paper’s theoretical underpinnings, we highly recommend Jeremi Nuer’s debut TDS article—you’ll leave it with a firm grasp of the researchers’ thinking and of this work’s stakes for future model development: “as improvements plateau and it becomes more difficult to scale LLMs, it will be important to truly understand how they work if we want to make the next leap in performance.”The Meaning of Explainability for AITaking a few helpful steps back from specific models and the technical challenges they create in their wake, Stephanie Kirmer gets “a bit philosophical” in her article about the limits of interpretability; attempts to illuminate those black-box models might never achieve full transparency, she argues, but are still important for ML researchers and developers to invest in.Photo by Joanna Kosinska on UnsplashAdditive Decision TreesIn his recent work, W Brett Kennedy has been focusing on interpretable predictive models, unpacking their underlying math and showing how they work in practice. His recent deep dive on additive decision trees is a powerful and thorough introduction to such a model, showing how it aims to supplement the limited available options for interpretable classification and regression models.Deep Dive on Accumulated Local Effect Plots (ALEs) with PythonTo round out our selection, we’re thrilled to share Conor O’Sullivan’s hands-on exploration of accumulated local effect plots (ALEs): an older, but dependable method for providing clear interpretations even in the presence of multicollinearity in your model.Interested in digging into some other topics this week? From quantization to Pokémon optimization strategies, we’ve got you covered!In a fascinating project walkthrough, Parvathy Krishnan, Joaquim Gromicho, and Kai Kaiser show how they’ve combined several geospatial datasets and some Python to optimize the process of selecting healthcare-facility locations.Learn how weight quantization works and how to apply it in real-world deep learning workflows — Chien Vu’s tutorial is both thorough and accessible.The knapsack problem is a classic optimization challenge; Maria Mouschoutzi, PhD approaches it with a fun new twist, showing how to create the most powerful Pokémon team with the aid of modeling and PuLP, a Python optimization framework.Squeezing the most value out of RAG systems continues to be a top priority for many ML professionals. Leonie Monigatti takes a close look at potential solutions for measuring context relevance.After more than a decade as a data leader at tech giants and high-growth startups, Torsten Walbaum offers the insights he’s accumulated around a fundamental question: how do we make sense of data?Data analysts might not often think of themselves as programmers, but there’s still a lot of room for cross-disciplinary learning—as Mariya Mansurova demonstrates in a data-focused roundup of software-engineering best practices.Thank you for supporting the work of our authors! We love publishing articles from new authors, so if you’ve recently written an interesting project walkthrough, tutorial, or theoretical reflection on any of our core topics, don’t hesitate to share it with us.Until the next Variable,TDS TeamSparse Autoencoders, Additive Decision Trees, and Other Emerging Topics in AI Interpretability was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



Join our newsletter to get the free update, insight, promotions.