Skip to Content

Pandas Get Dummies Function – get_dummies()

In this tutorial, we will look at the purpose and the usage of the pandas get_dummies() function with the help of some examples.

What does the pandas get_dummies() function do?

The pandas get_dummies() function is used to convert a categorical variable to indicator/dummy variables (columns). It returns the dummy coded data as a pandas dataframe.

pandas get_dummies() function

Let’s apply this function to a list containing t-shirt sizes of 5 students in a class.

import pandas as pd

# list with t-shirt sizes
ls = ['M', 'L', 'S', 'XL', 'M']
# get dummies
pd.get_dummies(ls)

Output:

Dummy data for t-shirt sizes list as dataframe.

You can see that we get the dummy data for the above list as a dataframe. Note that we have one column for each unique value in the list and each row represents a list item with the respective t-shirt size.

Encode Categorical Columns in Pandas Dataframe

Generally, the get_dummies() a function is applied to categorical columns in a pandas dataframe to generate dummy (one-hot encoded) columns. This is an important step in data science / ML pipelines that require data in numeric form.

Let’s look at some examples of using the pandas get_dummies() function to encode categorical columns.

Get Dummies for a single column

Here we pass a single dataframe column to the get_dummies() function. Let’s look at an example. First, we will create a sample dataframe.

import pandas as pd

# create dataframe 
df = pd.DataFrame({
    "student_id": [1, 2, 3, 4, 5, 6],
    "year": ["Senior", "Senior", "Junior", "Sophomore", "Freshman", "Freshman"],
    "shirt_size": ['M', 'L', 'S', 'S', 'M', 'M']
})
# display the dataframe
df

Output:

university students dataframe

We have a dataframe containing the student_id, year, and the t-shirt sizes of some students in a university. Let’s one-hot encode the “shirt_size” column.

# one-hot encode the "shirt_size" column
pd.get_dummies(df["shirt_size"])

Output:

result from pandas get_dummies on single column

It returns a dataframe resulting from encoding the “shirt_size” column. Note that each unique size has a separate column.

You can also specify a prefix to use for all the dummy columns. Pass your desired prefix as an argument to the prefix parameter of the get_dummies() function.

# one-hot encode the "shirt_size" column
pd.get_dummies(df["shirt_size"], prefix="shirt_size")

Output:

dummies from shirt_size column with prefix

Here we use the column name, “shirt_size” as the prefix for each dummy column.

Get Dummies for Multiple Columns

You can also pass a dataframe with multiple columns to the get_dummies() function. It returns a dummy-coded dataframe from all the categorical columns in the dataframe.

Let’s look at an example. This time, let’s pass the entire dataframe df used in the above example to the get_dummies() function.

# one-hot encode the all categorical columns
pd.get_dummies(df)

Output:

result of pandas get_dummies function on entire dataframe.

We get one-hot encoded data for all the categorical columns – “class” and “shirt_size” as a dataframe. Note that the numerical column “student_id” remained unchanged. Also, note that we didn’t have to specify a prefix here (the function itself used the column names as prefixes since there are multiple categorical fields in the dataframe).

You might also be interested in –

  1. Pandas – Get All Unique Values in a Column
  2. Most frequent value in a Pandas Column


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Author

  • Piyush is a data scientist passionate about using data to understand things better and make informed decisions. In the past, he's worked as a Data Scientist for ZS and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.