Similarity Measures for Data Science

Similarity Measures for Data Science

 

Introduction

Understanding whether two entities are alike is a core data science and business question. We may be comparing customer segments, trying to predict customer behavior based off similar customers in the past, or recommending new products to customers.

All of these questions require a similarity measure for comparing two entities. While this may seem inconsequential, results can vary significantly based on the choice of similarity measure. 

Similarity metrics are use in many machine learning algorithms such as k-means clustering, k-nearest neighbors, and recommendation engines.

In this post, I will give a brief overview of how each similarity measure is calculated and then discuss when it is most appropriate. I will cover four common similarity measures:

  1. Euclidean
  2. Manhattan
  3. Cosine
  4. Jaccard

Note: I will use the terms similarity measures/metrics, and distance measures/metrics throughout this post. They are two sides of the same coin. Similarity measures/metrics give higher scores when two objects are more similar. Distance measures/metrics give lower scores when two objects are more similar (i.e. closer).

 

Basics

As with all data science analyses, we approach this question with a mathematical framework. If we can describe the properties of an entity with a list of numbers, we can calculate how similar two entities are.

What do I mean? Suppose our entities are vehicles. Suppose a vehicle has two properties: number of wheels and horsepower. Vehicle A has 4 wheels and 250 horsepower, vehicle B has 4 wheels and 225 horsepower, and vehicle C has 2 wheels and 100 horsepower. Intuitively, vehicles A and B are the most similar, vehicles B and C are the second most similar, and vehicles A and C are the least similar.

Vehicle Number of Wheels Horsepower
A 4 250
B 4 225
C 2 100


Similarity metrics allow us to formalize and quantify the subjective process of discerning similarity.

 

Euclidean Distance

Euclidean distance is the standard distance metric we all learned in school. It comes from the pythagorean theorem which gives a formula calculating the length of the hypotenuse of a right triangle. In general, when people talk about how far away point A is from point B they are using euclidean distance.

https://hlab.stanford.edu/brian/making_measurements.html

Euclidean distance is more sensitive to magnitude than other distance metrics. This means features (aka properties) with larger magnitudes will have more influence over the results. Consider three data points: (2, 1000), (5, 1500), and (2, 1750). Euclidean distance says points one and two are closer than one and three. If you look at the formula you’ll note that even significant changes in the first feature, such as doubling or tripling its value will not have much effect on the outcome.

Exaggeration of the effect of the certain features may or may not be desirable. If our rows are customers, our features are products, and the value in each cell is total sales, maybe we are happy to leave the values as is because we care more about products with higher sales totals. However, we may not care how much of  a product a customer purchases, just whether or not they did purchase.

In order to mitigate the effects of varying magnitudes across features you perform scaling, usually normalization or standardization. The details of scaling are beyond the scope of this post, but scaling “squashes” the values and prevents features from having a greater influence over the result simply because they have larger numbers. Often times, features only have larger numbers because of arbitrary decisions on how they are measured. For example if we are looking at individuals’ age and income, we could easily measure income in the $1,000’s instead of $1’s.

Overall, Euclidean distance is easy to explain and sensitive to outliers but can be made more robust with scaling. If you are happy with the scale of each feature, then Euclidean is a good metric to use.

 

Manhattan Distance

Manhattan distance is less well known than Euclidean, but even easier to calculate and understand. It is based off the of the grid of streets in Manhattan. To get from point A to point B you cannot take diagonals, you can only walk east-west and north-south. By that same logic, Manhattan distance adds the distance required to move in each dimension to get from A to B.

Euclidean distance is in green. Manhattan distance is red, blue, and yellow (They are all the same distance). https://en.wikipedia.org/wiki/Taxicab_geometry#/media/File:Manhattan_distance.svg 

Manhattan distance is less sensitive to magnitude than Euclidean. Consider the three data points: (1,1), (3,3), and (1,4). Using Euclidean distance, points one and two are closer than points one and three. Using Manhattan distance points one and two are the same distance apart as points one and three. The affect of the 4 in point three is “softened.”

Similar to Euclidean distance, scaling may or may not be desirable. For example if we are using different features of sales data, we do not want some features in the $1,000s and others in the $1s.

Although less well known than Euclidean distance, Manhattan is often a solid if not better choice for its ease of explanation and more robustness to outliers.

 

Cosine Similarity

Cosine Similarity is a popular similarity metric because it is robust to magnitude differences in the data. It is good for numeric and binary data.

Cosine similarity is calculated by measuring the angle between vectors (in our case data points) and then taking the cosine of the angle. If the angle is zero, the cosine is one, which is the most similar two data points can be.

 

https://engineering.aweber.com/wp-content/uploads/2013/02/4AUbj.png

Cosine similarity is good when you care more about ratio of features with one another instead of feature magnitude. For example, let’s consider consumer buying patterns of shirts and socks. We have three consumers with the following feature vectors: (5,5), (20,20), an (10,20). Most humans would agree that consumer one and two exhibit the most similar buying trends because they have 1:1 ratio between socks and t-shirts. However Euclidean distance finds that one and three are more similar. Cosine similarity gets the right answer here; Customers one and two have feature vectors in the same direction, therefore are more similar.

 

Jaccard Similarity

Jaccard similarity and its counterpart the Jaccard distance are less well known metrics but are simple to calculate and give solid results. Jaccard similarity is useful when we have binary features (i.e. yes or now variables).

To calculate the Jaccard Similarity sum the intersection of the two sets and divide by the union. In the binary feature case, we sum the number of features in common (i.e. where both vectors have a one) and divide by the number of features in total (i.e. where at least one vector has a one).

For example, if we have two vectors (1,0,0) and (1,1,0) they would have a Jaccard similarity of 1/3. The numerator is one because only the first feature is one for both vectors. The second feature is different, while the last feature is zero for both which does not count towards the similarity. The denominator is three because there are three features.

Note, it might seem strange we do not count the case where both vectors have a zero in the same feature as an intersection. Consider the example where feature is US Citizen Yes or No; two US Citizens are much more similar than two non-US Citizens as the non-Citizens are likely from two very different countries. The commonality of absence implies the absence of commonality.

Jaccard Similarity is strong when we do not care about how much, only yes or no. Business applications include purchase history and viewing history. For example we can make each feature a yes/no variable for the purchase of a specific product. If you are worried this might group together individuals who bought $1 and $1,000 of the same product you can set a cutoff threshold for the feature and only mark those above the threshold as a yes.

Conclusion

Your choice of distance metric can lead to vastly different results so be careful when choosing a distance metric. Understanding the consequences of your chosen distance metric is crucial to making sure your analysis is answering the right questions. Should number of pairs of socks bought be a more important factor in decision making than number of computers bought just because people buy dozens more socks than computers? Is it more important that customers always buy a one to one ratio of mice to computers or that the overall quantity of mice and computers is similar between customers? Be active in the process of choosing a distance metric.