In this tutorial, we will look at the purpose and the usage of the pandas get_dummies() function with the help of some examples.
What does the pandas get_dummies()
function do?
The pandas get_dummies()
function is used to convert a categorical variable to indicator/dummy variables (columns). It returns the dummy coded data as a pandas dataframe.
Let’s apply this function to a list containing t-shirt sizes of 5 students in a class.
import pandas as pd # list with t-shirt sizes ls = ['M', 'L', 'S', 'XL', 'M'] # get dummies pd.get_dummies(ls)
Output:
You can see that we get the dummy data for the above list as a dataframe. Note that we have one column for each unique value in the list and each row represents a list item with the respective t-shirt size.
Encode Categorical Columns in Pandas Dataframe
Generally, the get_dummies()
a function is applied to categorical columns in a pandas dataframe to generate dummy (one-hot encoded) columns. This is an important step in data science / ML pipelines that require data in numeric form.
Let’s look at some examples of using the pandas get_dummies() function to encode categorical columns.
Get Dummies for a single column
Here we pass a single dataframe column to the get_dummies()
function. Let’s look at an example. First, we will create a sample dataframe.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
import pandas as pd # create dataframe df = pd.DataFrame({ "student_id": [1, 2, 3, 4, 5, 6], "year": ["Senior", "Senior", "Junior", "Sophomore", "Freshman", "Freshman"], "shirt_size": ['M', 'L', 'S', 'S', 'M', 'M'] }) # display the dataframe df
Output:
We have a dataframe containing the student_id, year, and the t-shirt sizes of some students in a university. Let’s one-hot encode the “shirt_size” column.
# one-hot encode the "shirt_size" column pd.get_dummies(df["shirt_size"])
Output:
It returns a dataframe resulting from encoding the “shirt_size” column. Note that each unique size has a separate column.
You can also specify a prefix to use for all the dummy columns. Pass your desired prefix as an argument to the prefix
parameter of the get_dummies()
function.
# one-hot encode the "shirt_size" column pd.get_dummies(df["shirt_size"], prefix="shirt_size")
Output:
Here we use the column name, “shirt_size” as the prefix for each dummy column.
Get Dummies for Multiple Columns
You can also pass a dataframe with multiple columns to the get_dummies()
function. It returns a dummy-coded dataframe from all the categorical columns in the dataframe.
Let’s look at an example. This time, let’s pass the entire dataframe df used in the above example to the get_dummies()
function.
# one-hot encode the all categorical columns pd.get_dummies(df)
Output:
We get one-hot encoded data for all the categorical columns – “class” and “shirt_size” as a dataframe. Note that the numerical column “student_id” remained unchanged. Also, note that we didn’t have to specify a prefix here (the function itself used the column names as prefixes since there are multiple categorical fields in the dataframe).
You might also be interested in –
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.