In this tutorial, we will look at how to handle missing values in a category type column in Pandas with the help of some examples.
The category
type in Pandas is quite handy when working with categorical data. It takes a relatively smaller amount of memory (if there are a large number of repeated values present) compared to the object
type used to store string values.
Can categorical data in Pandas have missing values?
Yes, categorical data (or, in this tutorial, a category
type column) in Pandas can have missing values in the data but missing values should not be included in the possible values for the categories. Let’s look at an example.
import numpy as np import pandas as pd # create a dataframe df = pd.DataFrame({ "Name": ["Tim", "Sarah", "Hasan", "Jyoti", "Jack"], "Shirt Size": ["Large", "Small", "Small", "Medium", np.nan] }) # change to category dtype df["Shirt Size"] = df["Shirt Size"].astype("category") # display the "Shirt Size" column print(df["Shirt Size"])
Output:
0 Large 1 Small 2 Small 3 Medium 4 NaN Name: Shirt Size, dtype: category Categories (3, object): ['Large', 'Medium', 'Small']
In the above dataframe, the “Shirt Size” column is of category
type. Note that it has a NaN
value in the data and the possible categories for the column are “Large”, “Medium”, and “Small”.
NaN
values can also result in a category
column if you remove an existing category value from the possible categories. For example, let’s remove the “Small” category value from the “Shirt Size” column.
# remove "Small" as a category value from "Shirt Size" column df["Shirt Size"] = df["Shirt Size"].cat.remove_categories("Small") # display the "Shirt Size" column print(df["Shirt Size"])
Output:
0 Large 1 NaN 2 NaN 3 Medium 4 NaN Name: Shirt Size, dtype: category Categories (2, object): ['Large', 'Medium']
You can see that all the occurrences of “Small” in the original data have now been replaced by NaN
.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
You can apply common Pandas functions for working with missing values – isna()
, fillna()
, and dropna()
to a category type column as well. Let’s look at them with the help of some examples.
Check if values in a category column are missing
You can use the Pandas isna()
function to check whether a value in a column is a missing value or not. Let’s apply this function to the “Shirt Size” column in the above dataframe.
# check for missing values in "Shirt Size" column print(df["Shirt Size"].isna())
Output:
0 False 1 True 2 True 3 False 4 True Name: Shirt Size, dtype: bool
You can see that True
for all the instances of NaN
in the data.
Fill missing values in a category column
You can use the Pandas fillna()
function to fill missing values in a Pandas category column. Note that the value used for filling the missing values should be one of the possible category values.
For example, let’s fill the missing values in the “Shirt Size” column with the category value “Medium”
# fill missing values with "Medium" print(df["Shirt Size"].fillna("Medium"))
Output:
0 Large 1 Medium 2 Medium 3 Medium 4 Medium Name: Shirt Size, dtype: category Categories (2, object): ['Large', 'Medium']
We get a Pandas series with NaN
filled by “Medium”. Note that we haven’t modified the original column values here, just printing out the return value from the fillna()
function.
If you try to fill the missing value in a category column with a value other than a possible category value, it will result in an error.
# fill missing values with "Xtra Large" print(df["Shirt Size"].fillna("Xtra Large"))
Output:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [7], in <module> 1 # fill missing values with "Xtra Large" ----> 2 print(df["Shirt Size"].fillna("Xtra Large")) ..... ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Here, we try to fill the missing values in the “Shirt Size” column with “Xtra Large”, a value that is not a part of the possible category values and thus, we get an error.
Drop missing values in a category column
You can use the Pandas dropna()
function to drop records with missing values in a Pandas categorical column.
First, let’s print out the dataframe that we’re working with.
# display the dataframe print(df)
Output:
Name Shirt Size 0 Tim Large 1 Sarah NaN 2 Hasan NaN 3 Jyoti Medium 4 Jack NaN
Let’s now drop rows with missing values.
# drop rows with missing values print(df.dropna())
Output:
Name Shirt Size 0 Tim Large 3 Jyoti Medium
You can see that the resulting dataframe doesn’t have any missing values.
You might also be interested in –
- Get List of Categories in Pandas Category Column
- Pandas – Set Category Order of a Categorical Column
- Add New Categories to a Category Column in Pandas
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.