Calculate Jaccard Similarity in Python

Jaccard Similarity is commonly used to evaluate how similar two pieces of texts are. For example, how similar two tweets are based on the contents of the tweets. In this tutorial, we will look at what is Jaccard Similarity and how to calculate it in Python. We will also look at Jaccard Distance, another metric that is commonly used with the help of some examples.

Jaccard similarity illustration with shapes

Jaccard Similarity is a measure of how similar two sets are based on the items present in both the sets. It is defined as the fraction of number of common elements in two sets to the total number of elements in the union of the two sets. The following is its formula.

Jaccard similarity between sets A and B.

Jaccard Similarity ranges from 0 to 1. The higher the similarity, the more similar the two sets are.

Let’s walk through an example. Here are two tweets by Elon Musk. Let’s calculate the Jaccard Similarity between these tweets.

Now, let’s say we apply some preprocessing to the above sentences – all lowercase, remove punctuations and tokenize the sentence into set of words, we get the following two sets.

Sentences as sets of words.

According to the formula, we need to determine the number of items in the intersection and the union of the two sets and divide the two to get the Jaccard Similarity.

Jaccard similarity calculation on the two sentences.

You can see that we get the Jaccard Similarity between the two tweets as 0.14.

Now that we know how Jaccard Similarity is calculated, we can write a custom function to Python to compute the Jaccard Similarity between two lists.

def jaccard_similarity(a, b):
    # convert to set
    a = set(a)
    b = set(b)
    # calucate jaccard similarity
    j = float(len(a.intersection(b))) / len(a.union(b))
    return j

Let’s now see the above code in action with the help of some examples.

Let’s pass two lists of strings to the above function to get the Jaccard Similarity between them.

l1 = ["no", "high", "no", "low", "only", "doge"]
l2 = ["one", "word", "doge"]

jaccard_similarity(l1, l2)

Output:

0.14285714285714285

We get ~0.14 as the output, which is the same result we got from manual calculation above.

Let’s now pass two lists of integers to the above function.

l1 = [1, 2, 3]
l2 = [1, 2, 4]

jaccard_similarity(l1, l2)

Output:

0.5

We get 0.5 as the output. You can see that {1, 2} are the common elements and {1, 2, 3, 4} is the union. Thus, the Jaccard Similarity comes out to be 0.5.

Note that if there are no common elements between the two sets, the Jaccard Similarity would be zero.

What would happen if we pass strings instead of list of strings to the above function? Let’s find out.

# jaccard similarity between two strings
jaccard_similarity("morning", "evening")

Output:

0.375

You can see that we do get a similarity score for the two strings but what is happening underneath.

If you look at the function, we are creating sets from the two strings, now these sets are sets of individual characters in the strings which are then used to compute the Jaccard Similarity.

print(set("morning"))
print(set("evening"))

Output:

{'o', 'm', 'n', 'r', 'i', 'g'}
{'n', 'e', 'v', 'i', 'g'}

It is used as a measure of how dissimilar two sets of values are. It is defined as one minus the Jaccard Similarity.

Jaccard distance formula

Let’s use the above function we created to calculate the Jaccard Distance between two lists.

l1 = [1, 2, 1]
l2 = [1, 5, 7]

# jaccard distance
d = 1 - jaccard_similarity(l1,l2)
print(d)

Output

0.75

There are many other measures of distances between two lists of values. For example, Euclidean distance, Manhattan distance, etc.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Leave a Reply

Your email address will not be published. Required fields are marked *