Calculate Jaccard Similarity in Python

Jaccard Similarity is commonly used to evaluate how similar two pieces of texts are. For example, how similar two tweets are based on the contents of the tweets. In this tutorial, we will look at what is Jaccard Similarity and how to calculate it in Python. We will also look at Jaccard Distance, another metric that is commonly used with the help of some examples.

What is Jaccard Similarity?

Jaccard similarity illustration with shapes

Jaccard Similarity is a measure of how similar two sets are based on the items present in both the sets. It is defined as the fraction of number of common elements in two sets to the total number of elements in the union of the two sets. The following is its formula.

Jaccard similarity between sets A and B.

Jaccard Similarity ranges from 0 to 1. The higher the similarity, the more similar the two sets are.

Let’s walk through an example. Here are two tweets by Elon Musk. Let’s calculate the Jaccard Similarity between these tweets.

No highs, no lows, only Doge
— Elon Musk (@elonmusk) February 4, 2021

One word: Doge
— Elon Musk (@elonmusk) December 20, 2020

Now, let’s say we apply some preprocessing to the above sentences – all lowercase, remove punctuations and tokenize the sentence into set of words, we get the following two sets.

According to the formula, we need to determine the number of items in the intersection and the union of the two sets and divide the two to get the Jaccard Similarity.

📚 Data Science Programs By Skill Level

Introductory ⭐

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

Jaccard similarity calculation on the two sentences.

You can see that we get the Jaccard Similarity between the two tweets as 0.14.

Jaccard Similarity in Python

Now that we know how Jaccard Similarity is calculated, we can write a custom function to Python to compute the Jaccard Similarity between two lists.

def jaccard_similarity(a, b):
    # convert to set
    a = set(a)
    b = set(b)
    # calucate jaccard similarity
    j = float(len(a.intersection(b))) / len(a.union(b))
    return j

Let’s now see the above code in action with the help of some examples.

1. Jaccard Similarity between two lists of strings

Let’s pass two lists of strings to the above function to get the Jaccard Similarity between them.

l1 = ["no", "high", "no", "low", "only", "doge"]
l2 = ["one", "word", "doge"]

jaccard_similarity(l1, l2)

Output:

0.14285714285714285

We get ~0.14 as the output, which is the same result we got from manual calculation above.

2. Jaccard Similarity between two lists of integers

Let’s now pass two lists of integers to the above function.

l1 = [1, 2, 3]
l2 = [1, 2, 4]

jaccard_similarity(l1, l2)

Output:

0.5

We get 0.5 as the output. You can see that {1, 2} are the common elements and {1, 2, 3, 4} is the union. Thus, the Jaccard Similarity comes out to be 0.5.

Note that if there are no common elements between the two sets, the Jaccard Similarity would be zero.

3. Jaccard Similarity between two Strings

What would happen if we pass strings instead of list of strings to the above function? Let’s find out.

# jaccard similarity between two strings
jaccard_similarity("morning", "evening")

Output:

0.375

You can see that we do get a similarity score for the two strings but what is happening underneath.

If you look at the function, we are creating sets from the two strings, now these sets are sets of individual characters in the strings which are then used to compute the Jaccard Similarity.

print(set("morning"))
print(set("evening"))

Output:

{'o', 'm', 'n', 'r', 'i', 'g'}
{'n', 'e', 'v', 'i', 'g'}

Jaccard Distance

It is used as a measure of how dissimilar two sets of values are. It is defined as one minus the Jaccard Similarity.

Let’s use the above function we created to calculate the Jaccard Distance between two lists.

l1 = [1, 2, 1]
l2 = [1, 5, 7]

# jaccard distance
d = 1 - jaccard_similarity(l1,l2)
print(d)

Output

0.75

There are many other measures of distances between two lists of values. For example, Euclidean distance, Manhattan distance, etc.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5

Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.

Author

Piyush Raj

Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

View all posts