Python – Frequency of each word in String

In this tutorial, we’ll look at how to count the frequency of each word in a string corpus in python. We’ll also compare the frequency with visualizations like bar charts.

To count the frequency of each word in a string, you’ll first have to tokenize the string into individual words. Then, you can use the collections.Counter module to count each element in the list resulting in a dictionary of word counts. The following is the syntax:

import collections
s = "the cat and the dog are fighting"
s_counts = collections.Counter(s.split(" "))

Here, s_counts is a dictionary(more precisely, an object of collections.Counter which is a subclass of dict) storing the word: count mapping based on the frequency in the corpus. You can use it for all dictionary like functions. But, if you specifically want to convert it into a dictionary use dict(s_counts)

Let’s look at an example of extracting the frequency of each word from a string corpus in python.

We use the IMDB movie reviews dataset which you can download here. The dataset has 50000 reviews of movies filled by users. We’ll be using this dataset to see the most frequent words used by the reviewers in positive and negative reviews.

First we load the data as a pandas dataframe using the read_csv() function.

import pandas as pd

# read the csv file as a dataframe
reviews_df = pd.read_csv(r"C:\Users\piyush\Documents\Projects\movie_reviews_data\IMDB Dataset.csv")
print(reviews_df.head())

Output:

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

The dataframe has two columns – “review” storing the review of the movie and “sentiment” storing the sentiment associated with the review. Let’s examine how many samples do we have for each sentiment.

print(reviews_df['sentiment'].value_counts())

Output:

positive    25000
negative    25000
Name: sentiment, dtype: int64

We have 25000 samples each for “positive” and “negative” sentiments.

If we look at the entries in the “review” column, we can find that the reviews contain a number of unwanted elements or styles such as HTML tags, punctuations, inconsistent use of lower and upper case, etc. that could hinder our analysis. For example,

print(reviews_df['review'][1])

Output:

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.

You can see that in the above review, we have HTML tags, quotes, punctuations, etc. that could be cleaned. Let’s write a function to clean the text in the reviews.

import re
import string

def clean_text(text):
    """
    Function to clean the text.
    
    Parameters:
    text: the raw text as a string value that needs to be cleaned
    
    Returns:
    cleaned_text: the cleaned text as string
    """
    # convert to lower case
    cleaned_text = text.lower()
    # remove HTML tags
    html_pattern = re.compile('<.*?>')
    cleaned_text = re.sub(html_pattern, '', cleaned_text)
    # remove punctuations
    cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation))
    
    return cleaned_text.strip()

The above function performs the following operations on the text:

  1. Convert the text to lower case
  2. Remove HTML tags from the text using regular expressions.
  3. Remove punctuations from the text using a translation table.

Let’s see the above function in action.

print(clean_text(reviews_df['review'][1]))

Output:

a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done

You can see that now the text if fairly consistent to be split into individual words. Let’s apply this function to the “reviews” column and create a new column of clean reviews.

reviews_df['clean_review'] = reviews_df['review'].apply(clean_text)

You can use the string split() function to create a list of individual tokens from a string. For example,

print(clean_text(reviews_df['review'][1]).split(" "))

Output:

['a', 'wonderful', 'little', 'production', 'the', 'filming', 'technique', 'is', 'very', 'unassuming', 'very', 'oldtimebbc', 'fashion', 'and', 'gives', 'a', 'comforting', 'and', 'sometimes', 'discomforting', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece', 'the', 'actors', 'are', 'extremely', 'well', 'chosen', 'michael', 'sheen', 'not', 'only', 'has', 'got', 'all', 'the', 'polari', 'but', 'he', 'has', 'all', 'the', 'voices', 'down', 'pat', 'too', 'you', 'can', 'truly', 'see', 'the', 'seamless', 'editing', 'guided', 'by', 'the', 'references', 'to', 'williams', 'diary', 'entries', 'not', 'only', 'is', 'it', 'well', 'worth', 'the', 'watching', 'but', 'it', 'is', 'a', 'terrificly', 'written', 'and', 'performed', 'piece', 'a', 'masterful', 'production', 'about', 'one', 'of', 'the', 'great', 'masters', 'of', 'comedy', 'and', 'his', 'life', 'the', 'realism', 'really', 'comes', 'home', 'with', 'the', 'little', 'things', 'the', 'fantasy', 'of', 'the', 'guard', 'which', 'rather', 'than', 'use', 'the', 'traditional', 'dream', 'techniques', 'remains', 'solid', 'then', 'disappears', 'it', 'plays', 'on', 'our', 'knowledge', 'and', 'our', 'senses', 'particularly', 'with', 'the', 'scenes', 'concerning', 'orton', 'and', 'halliwell', 'and', 'the', 'sets', 'particularly', 'of', 'their', 'flat', 'with', 'halliwells', 'murals', 'decorating', 'every', 'surface', 'are', 'terribly', 'well', 'done']

Let’s create a new column with a list of tokenized words for each review.

reviews_df['review_ls'] = reviews_df['clean_review'].apply(lambda x: x.split(" "))
reviews_df.head()

Output:

dataframe of reviews with additional columns for clean text and tokenized list of words

Now that we have tokenized the reviews, we can create lists containing words in all the positive and negative reviews. For this, we’ll use itertools to chain together all the positive and negative reviews in single lists.

import itertools

# positive reviews
positive_reviews = reviews_df[reviews_df['sentiment']=='positive']['review_ls']
print("Total positive reviews: ", len(positive_reviews))
positive_reviews_words = list(itertools.chain(*positive_reviews))
print("Total words in positive reviews:", len(positive_reviews_words))

# negative reviews
negative_reviews = reviews_df[reviews_df['sentiment']=='negative']['review_ls']
print("Total negative reviews: ", len(negative_reviews))
negative_reviews_words = list(itertools.chain(*negative_reviews))
print("Total words in negative reviews:", len(negative_reviews_words))

Output:

Total positive reviews:  25000
Total words in positive reviews: 5721948
Total negative reviews:  25000
Total words in negative reviews: 5631466

Now we have one list each for all the words used in positive reviews and all the words used in negative reviews.

Let’s find the frequency of each word in the positive and the negative corpus. For this, we’ll use collections.Counter that returns an object which is essentially a dictionary with word to frequency mappings.

import collections

positive_words_frequency = collections.Counter(positive_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common positive words:", positive_words_frequency.most_common(10))

negative_words_frequency = collections.Counter(negative_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common negative words:", negative_words_frequency.most_common(10))

Output:

Most common positive words: [('the', 332496), ('and', 174195), ('a', 162381), ('of', 151419), ('to', 130495), ('is', 111355), ('in', 97366), ('it', 75383), ('i', 68680), ('this', 66846)]
Most common negative words: [('the', 318041), ('a', 156823), ('and', 145139), ('of', 136641), ('to', 135780), ('is', 98688), ('in', 85745), ('this', 78581), ('i', 76770), ('it', 75840)]

You can see that we get just the generic words like “the”, “a”, “and”, etc. as the most frequent words. Such words are called “stop words”, these words occur frequently in a corpus but does not necessarily offer discriminative information.

Let’s remove these “stop words” and see which words occur more frequently. To remove the stop words we’ll use the nltk library which has a predefined list of stop words for multiple languages.

import nltk
nltk.download("stopwords")

The above code downloads the stopwords from nltk. We can now go ahead and create a list of English stopwords.

from nltk.corpus import stopwords

# list of english stop words
stopwords_ls = list(set(stopwords.words("english")))
print("Total English stopwords: ", len(stopwords_ls))
print(stopwords_ls[:10])

Output:

Total English stopwords:  179
['some', 'than', 'below', 'once', 'ourselves', "it's", 'these', 'been', 'more', 'which']

We get a list of 179 English stopwords. Note that some of the stopwords have punctuations. If we are to remove stopwords from our corpus, it makes sense to apply the same preprocessing to the stopwords as well that we did to our corpus text.

# cleaning the words in the stopwords list
stopwords_ls = [clean_text(word) for word in stopwords_ls]
print(stopwords_ls[:10])

Output:

['some', 'than', 'below', 'once', 'ourselves', 'its', 'these', 'been', 'more', 'which']

Now, let’s go ahead and remove these words from our positive and negative reviews corpuses using list comprehensions.

# remove stopwords
positive_reviews_words = [word for word in positive_reviews_words if word not in stopwords_ls]
print("Total words in positive reviews:", len(positive_reviews_words))
negative_reviews_words = [word for word in negative_reviews_words if word not in stopwords_ls]
print("Total words in negative reviews:", len(negative_reviews_words))

Output:

Total words in positive reviews: 3019338
Total words in negative reviews: 2944033

We can see a significant reduction in size of the corpuses post removal of the stopwords. Now let’s see the most common words in the positive and the negative corpuses.

positive_words_frequency = collections.Counter(positive_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common positive words:", positive_words_frequency.most_common(10))

negative_words_frequency = collections.Counter(negative_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common negative words:", negative_words_frequency.most_common(10))

Output:

Most common positive words: [('film', 39412), ('movie', 36018), ('one', 25727), ('', 19273), ('like', 17054), ('good', 14342), ('great', 12643), ('story', 12368), ('see', 11864), ('time', 11770)]
Most common negative words: [('movie', 47480), ('film', 35040), ('one', 24632), ('like', 21768), ('', 21677), ('even', 14916), ('good', 14140), ('bad', 14065), ('would', 13633), ('really', 12218)]

You can see that have words like “good” and “great” occur frequently in positive reviews while the word “bad” is frequently present in negative reviews. Also, note that a number of words occur commonly in both positive and negative reviews. For example, “movie”, “film”, etc. which is due to the nature of the text data itself since it is mostly movie reviews.

We can visualize the above frequencies as charts to better show their counts. Let’s plot a horizontal bar chart of the 10 most frequent words in both the corpuses.

First, let’s create a dataframe each for the top 10 most frequent words in positive and negative corpuses.

positive_freq_words_df = pd.DataFrame(positive_words_frequency.most_common(10),
                                     columns=["Word", "Frequency"])
print(positive_freq_words_df)

Output:

    Word  Frequency
0   film      39412
1  movie      36018
2    one      25727
3             19273
4   like      17054
5   good      14342
6  great      12643
7  story      12368
8    see      11864
9   time      11770
negative_freq_words_df = pd.DataFrame(negative_words_frequency.most_common(10),
                                     columns=["Word", "Frequency"])
print(negative_freq_words_df)

Output:

     Word  Frequency
0   movie      47480
1    film      35040
2     one      24632
3    like      21768
4              21677
5    even      14916
6    good      14140
7     bad      14065
8   would      13633
9  really      12218

Horizontal bar plot of the most frequent words in the positive reviews:

import matplotlib.pyplot as plt

# set figure size
fig, ax = plt.subplots(figsize=(12, 8))
# plot horizontal bar plot
positive_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax)
# set the title
plt.title("Most Common words in positive corpus")
plt.show()

Output:

Horizontal bar plot of most frequent words in the positive corpus.

Horizontal bar plot of the most frequent words in the negative reviews:

# set figure size
fig, ax = plt.subplots(figsize=(10, 8))
# plot horizontal bar plot
negative_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax)
# set the title
plt.title("Most Common words in negative corpus")
plt.show()

Output:

Horizontal bar plot of most frequent words in the negative reviews.

The above was a good exploratory analysis to see the most frequent words used in the IMDB movie reviews dataset for positive and negative reviews. As a next step, you can go ahead and train your own sentiment analysis model to take in a movie review and predict whether it’s positive or negative.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.