In this tutorial, we’ll look at how to count the frequency of each word in a string corpus in python. We’ll also compare the frequency with visualizations like bar charts.
Count of each word in a string
To count the frequency of each word in a string, you’ll first have to tokenize the string into individual words. Then, you can use the collections.Counter module to count each element in the list resulting in a dictionary of word counts. The following is the syntax:
import collections s = "the cat and the dog are fighting" s_counts = collections.Counter(s.split(" "))
Here, s_counts is a dictionary(more precisely, an object of collections.Counter which is a subclass of dict) storing the word: count mapping based on the frequency in the corpus. You can use it for all dictionary like functions. But, if you specifically want to convert it into a dictionary use dict(s_counts)
Let’s look at an example of extracting the frequency of each word from a string corpus in python.
Count of each word in Movie Reviews dataset
We use the IMDB movie reviews dataset which you can download here. The dataset has 50000 reviews of movies filled by users. We’ll be using this dataset to see the most frequent words used by the reviewers in positive and negative reviews.
1 – Load the data
First we load the data as a pandas dataframe using the read_csv() function.
import pandas as pd # read the csv file as a dataframe reviews_df = pd.read_csv(r"C:\Users\piyush\Documents\Projects\movie_reviews_data\IMDB Dataset.csv") print(reviews_df.head())
Output:
review sentiment 0 One of the other reviewers has mentioned that ... positive 1 A wonderful little production. <br /><br />The... positive 2 I thought this was a wonderful way to spend ti... positive 3 Basically there's a family where a little boy ... negative 4 Petter Mattei's "Love in the Time of Money" is... positive
The dataframe has two columns – “review” storing the review of the movie and “sentiment” storing the sentiment associated with the review. Let’s examine how many samples do we have for each sentiment.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
print(reviews_df['sentiment'].value_counts())
Output:
positive 25000 negative 25000 Name: sentiment, dtype: int64
We have 25000 samples each for “positive” and “negative” sentiments.
2 – Cleaning the text
If we look at the entries in the “review” column, we can find that the reviews contain a number of unwanted elements or styles such as HTML tags, punctuations, inconsistent use of lower and upper case, etc. that could hinder our analysis. For example,
print(reviews_df['review'][1])
Output:
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.
You can see that in the above review, we have HTML tags, quotes, punctuations, etc. that could be cleaned. Let’s write a function to clean the text in the reviews.
import re import string def clean_text(text): """ Function to clean the text. Parameters: text: the raw text as a string value that needs to be cleaned Returns: cleaned_text: the cleaned text as string """ # convert to lower case cleaned_text = text.lower() # remove HTML tags html_pattern = re.compile('<.*?>') cleaned_text = re.sub(html_pattern, '', cleaned_text) # remove punctuations cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation)) return cleaned_text.strip()
The above function performs the following operations on the text:
- Convert the text to lower case
- Remove HTML tags from the text using regular expressions.
- Remove punctuations from the text using a translation table.
Let’s see the above function in action.
print(clean_text(reviews_df['review'][1]))
Output:
a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done
You can see that now the text if fairly consistent to be split into individual words. Let’s apply this function to the “reviews” column and create a new column of clean reviews.
reviews_df['clean_review'] = reviews_df['review'].apply(clean_text)
2 – Tokenize the text into words
You can use the string split() function to create a list of individual tokens from a string. For example,
print(clean_text(reviews_df['review'][1]).split(" "))
Output:
['a', 'wonderful', 'little', 'production', 'the', 'filming', 'technique', 'is', 'very', 'unassuming', 'very', 'oldtimebbc', 'fashion', 'and', 'gives', 'a', 'comforting', 'and', 'sometimes', 'discomforting', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece', 'the', 'actors', 'are', 'extremely', 'well', 'chosen', 'michael', 'sheen', 'not', 'only', 'has', 'got', 'all', 'the', 'polari', 'but', 'he', 'has', 'all', 'the', 'voices', 'down', 'pat', 'too', 'you', 'can', 'truly', 'see', 'the', 'seamless', 'editing', 'guided', 'by', 'the', 'references', 'to', 'williams', 'diary', 'entries', 'not', 'only', 'is', 'it', 'well', 'worth', 'the', 'watching', 'but', 'it', 'is', 'a', 'terrificly', 'written', 'and', 'performed', 'piece', 'a', 'masterful', 'production', 'about', 'one', 'of', 'the', 'great', 'masters', 'of', 'comedy', 'and', 'his', 'life', 'the', 'realism', 'really', 'comes', 'home', 'with', 'the', 'little', 'things', 'the', 'fantasy', 'of', 'the', 'guard', 'which', 'rather', 'than', 'use', 'the', 'traditional', 'dream', 'techniques', 'remains', 'solid', 'then', 'disappears', 'it', 'plays', 'on', 'our', 'knowledge', 'and', 'our', 'senses', 'particularly', 'with', 'the', 'scenes', 'concerning', 'orton', 'and', 'halliwell', 'and', 'the', 'sets', 'particularly', 'of', 'their', 'flat', 'with', 'halliwells', 'murals', 'decorating', 'every', 'surface', 'are', 'terribly', 'well', 'done']
Let’s create a new column with a list of tokenized words for each review.
reviews_df['review_ls'] = reviews_df['clean_review'].apply(lambda x: x.split(" ")) reviews_df.head()
Output:
3 – Create a corpus for positive and negative reviews
Now that we have tokenized the reviews, we can create lists containing words in all the positive and negative reviews. For this, we’ll use itertools to chain together all the positive and negative reviews in single lists.
import itertools # positive reviews positive_reviews = reviews_df[reviews_df['sentiment']=='positive']['review_ls'] print("Total positive reviews: ", len(positive_reviews)) positive_reviews_words = list(itertools.chain(*positive_reviews)) print("Total words in positive reviews:", len(positive_reviews_words)) # negative reviews negative_reviews = reviews_df[reviews_df['sentiment']=='negative']['review_ls'] print("Total negative reviews: ", len(negative_reviews)) negative_reviews_words = list(itertools.chain(*negative_reviews)) print("Total words in negative reviews:", len(negative_reviews_words))
Output:
Total positive reviews: 25000 Total words in positive reviews: 5721948 Total negative reviews: 25000 Total words in negative reviews: 5631466
Now we have one list each for all the words used in positive reviews and all the words used in negative reviews.
4 – Estimate the word frequency in the corpus
Let’s find the frequency of each word in the positive and the negative corpus. For this, we’ll use collections.Counter
that returns an object which is essentially a dictionary with word to frequency mappings.
import collections positive_words_frequency = collections.Counter(positive_reviews_words) # top 10 most frequent words in positive reviews print("Most common positive words:", positive_words_frequency.most_common(10)) negative_words_frequency = collections.Counter(negative_reviews_words) # top 10 most frequent words in positive reviews print("Most common negative words:", negative_words_frequency.most_common(10))
Output:
Most common positive words: [('the', 332496), ('and', 174195), ('a', 162381), ('of', 151419), ('to', 130495), ('is', 111355), ('in', 97366), ('it', 75383), ('i', 68680), ('this', 66846)] Most common negative words: [('the', 318041), ('a', 156823), ('and', 145139), ('of', 136641), ('to', 135780), ('is', 98688), ('in', 85745), ('this', 78581), ('i', 76770), ('it', 75840)]
You can see that we get just the generic words like “the”, “a”, “and”, etc. as the most frequent words. Such words are called “stop words”, these words occur frequently in a corpus but does not necessarily offer discriminative information.
Let’s remove these “stop words” and see which words occur more frequently. To remove the stop words we’ll use the nltk
library which has a predefined list of stop words for multiple languages.
import nltk nltk.download("stopwords")
The above code downloads the stopwords from nltk. We can now go ahead and create a list of English stopwords.
from nltk.corpus import stopwords # list of english stop words stopwords_ls = list(set(stopwords.words("english"))) print("Total English stopwords: ", len(stopwords_ls)) print(stopwords_ls[:10])
Output:
Total English stopwords: 179 ['some', 'than', 'below', 'once', 'ourselves', "it's", 'these', 'been', 'more', 'which']
We get a list of 179 English stopwords. Note that some of the stopwords have punctuations. If we are to remove stopwords from our corpus, it makes sense to apply the same preprocessing to the stopwords as well that we did to our corpus text.
# cleaning the words in the stopwords list stopwords_ls = [clean_text(word) for word in stopwords_ls] print(stopwords_ls[:10])
Output:
['some', 'than', 'below', 'once', 'ourselves', 'its', 'these', 'been', 'more', 'which']
Now, let’s go ahead and remove these words from our positive and negative reviews corpuses using list comprehensions.
# remove stopwords positive_reviews_words = [word for word in positive_reviews_words if word not in stopwords_ls] print("Total words in positive reviews:", len(positive_reviews_words)) negative_reviews_words = [word for word in negative_reviews_words if word not in stopwords_ls] print("Total words in negative reviews:", len(negative_reviews_words))
Output:
Total words in positive reviews: 3019338 Total words in negative reviews: 2944033
We can see a significant reduction in size of the corpuses post removal of the stopwords. Now let’s see the most common words in the positive and the negative corpuses.
positive_words_frequency = collections.Counter(positive_reviews_words) # top 10 most frequent words in positive reviews print("Most common positive words:", positive_words_frequency.most_common(10)) negative_words_frequency = collections.Counter(negative_reviews_words) # top 10 most frequent words in positive reviews print("Most common negative words:", negative_words_frequency.most_common(10))
Output:
Most common positive words: [('film', 39412), ('movie', 36018), ('one', 25727), ('', 19273), ('like', 17054), ('good', 14342), ('great', 12643), ('story', 12368), ('see', 11864), ('time', 11770)] Most common negative words: [('movie', 47480), ('film', 35040), ('one', 24632), ('like', 21768), ('', 21677), ('even', 14916), ('good', 14140), ('bad', 14065), ('would', 13633), ('really', 12218)]
You can see that have words like “good” and “great” occur frequently in positive reviews while the word “bad” is frequently present in negative reviews. Also, note that a number of words occur commonly in both positive and negative reviews. For example, “movie”, “film”, etc. which is due to the nature of the text data itself since it is mostly movie reviews.
5 – Visualize the word counts
We can visualize the above frequencies as charts to better show their counts. Let’s plot a horizontal bar chart of the 10 most frequent words in both the corpuses.
First, let’s create a dataframe each for the top 10 most frequent words in positive and negative corpuses.
positive_freq_words_df = pd.DataFrame(positive_words_frequency.most_common(10), columns=["Word", "Frequency"]) print(positive_freq_words_df)
Output:
Word Frequency 0 film 39412 1 movie 36018 2 one 25727 3 19273 4 like 17054 5 good 14342 6 great 12643 7 story 12368 8 see 11864 9 time 11770
negative_freq_words_df = pd.DataFrame(negative_words_frequency.most_common(10), columns=["Word", "Frequency"]) print(negative_freq_words_df)
Output:
Word Frequency 0 movie 47480 1 film 35040 2 one 24632 3 like 21768 4 21677 5 even 14916 6 good 14140 7 bad 14065 8 would 13633 9 really 12218
Horizontal bar plot of the most frequent words in the positive reviews:
import matplotlib.pyplot as plt # set figure size fig, ax = plt.subplots(figsize=(12, 8)) # plot horizontal bar plot positive_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax) # set the title plt.title("Most Common words in positive corpus") plt.show()
Output:
Horizontal bar plot of the most frequent words in the negative reviews:
# set figure size fig, ax = plt.subplots(figsize=(10, 8)) # plot horizontal bar plot negative_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax) # set the title plt.title("Most Common words in negative corpus") plt.show()
Output:
Next Steps
The above was a good exploratory analysis to see the most frequent words used in the IMDB movie reviews dataset for positive and negative reviews. As a next step, you can go ahead and train your own sentiment analysis model to take in a movie review and predict whether it’s positive or negative.
With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.