Jaccard Similarity is commonly used to evaluate how similar two pieces of texts are. For example, how similar two tweets are based on the contents of the tweets. In this tutorial, we will look at what is Jaccard Similarity and how to calculate it in Python. We will also look at Jaccard Distance, another metric that is commonly used with the help of some examples.
What is Jaccard Similarity?
Jaccard Similarity is a measure of how similar two sets are based on the items present in both the sets. It is defined as the fraction of number of common elements in two sets to the total number of elements in the union of the two sets. The following is its formula.
Jaccard Similarity ranges from 0 to 1. The higher the similarity, the more similar the two sets are.
Let’s walk through an example. Here are two tweets by Elon Musk. Let’s calculate the Jaccard Similarity between these tweets.
Now, let’s say we apply some preprocessing to the above sentences – all lowercase, remove punctuations and tokenize the sentence into set of words, we get the following two sets.
According to the formula, we need to determine the number of items in the intersection and the union of the two sets and divide the two to get the Jaccard Similarity.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
You can see that we get the Jaccard Similarity between the two tweets as 0.14.
Jaccard Similarity in Python
Now that we know how Jaccard Similarity is calculated, we can write a custom function to Python to compute the Jaccard Similarity between two lists.
def jaccard_similarity(a, b): # convert to set a = set(a) b = set(b) # calucate jaccard similarity j = float(len(a.intersection(b))) / len(a.union(b)) return j
Let’s now see the above code in action with the help of some examples.
1. Jaccard Similarity between two lists of strings
Let’s pass two lists of strings to the above function to get the Jaccard Similarity between them.
l1 = ["no", "high", "no", "low", "only", "doge"] l2 = ["one", "word", "doge"] jaccard_similarity(l1, l2)
Output:
0.14285714285714285
We get ~0.14 as the output, which is the same result we got from manual calculation above.
2. Jaccard Similarity between two lists of integers
Let’s now pass two lists of integers to the above function.
l1 = [1, 2, 3] l2 = [1, 2, 4] jaccard_similarity(l1, l2)
Output:
0.5
We get 0.5 as the output. You can see that {1, 2} are the common elements and {1, 2, 3, 4} is the union. Thus, the Jaccard Similarity comes out to be 0.5.
Note that if there are no common elements between the two sets, the Jaccard Similarity would be zero.
3. Jaccard Similarity between two Strings
What would happen if we pass strings instead of list of strings to the above function? Let’s find out.
# jaccard similarity between two strings jaccard_similarity("morning", "evening")
Output:
0.375
You can see that we do get a similarity score for the two strings but what is happening underneath.
If you look at the function, we are creating sets from the two strings, now these sets are sets of individual characters in the strings which are then used to compute the Jaccard Similarity.
print(set("morning")) print(set("evening"))
Output:
{'o', 'm', 'n', 'r', 'i', 'g'} {'n', 'e', 'v', 'i', 'g'}
Jaccard Distance
It is used as a measure of how dissimilar two sets of values are. It is defined as one minus the Jaccard Similarity.
Let’s use the above function we created to calculate the Jaccard Distance between two lists.
l1 = [1, 2, 1] l2 = [1, 5, 7] # jaccard distance d = 1 - jaccard_similarity(l1,l2) print(d)
Output
0.75
There are many other measures of distances between two lists of values. For example, Euclidean distance, Manhattan distance, etc.
With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.