missing values in pandas category column

Missing Values in Pandas Category Column

In this tutorial, we will look at how to handle missing values in a category type column in Pandas with the help of some examples.

The category type in Pandas is quite handy when working with categorical data. It takes a relatively smaller amount of memory (if there are a large number of repeated values present) compared to the object type used to store string values.

Can categorical data in Pandas have missing values?

Yes, categorical data (or, in this tutorial, a category type column) in Pandas can have missing values in the data but missing values should not be included in the possible values for the categories. Let’s look at an example.

import numpy as np
import pandas as pd

# create a dataframe
df = pd.DataFrame({
        "Name": ["Tim", "Sarah", "Hasan", "Jyoti", "Jack"],
        "Shirt Size": ["Large", "Small", "Small", "Medium", np.nan]
})
# change to category dtype
df["Shirt Size"] = df["Shirt Size"].astype("category")
# display the "Shirt Size" column
print(df["Shirt Size"])

Output:

0     Large
1     Small
2     Small
3    Medium
4       NaN
Name: Shirt Size, dtype: category
Categories (3, object): ['Large', 'Medium', 'Small']

In the above dataframe, the “Shirt Size” column is of category type. Note that it has a NaN value in the data and the possible categories for the column are “Large”, “Medium”, and “Small”.

NaN values can also result in a category column if you remove an existing category value from the possible categories. For example, let’s remove the “Small” category value from the “Shirt Size” column.

# remove "Small" as a category value from "Shirt Size" column
df["Shirt Size"] = df["Shirt Size"].cat.remove_categories("Small")
# display the "Shirt Size" column
print(df["Shirt Size"])

Output:

0     Large
1       NaN
2       NaN
3    Medium
4       NaN
Name: Shirt Size, dtype: category
Categories (2, object): ['Large', 'Medium']

You can see that all the occurrences of “Small” in the original data have now been replaced by NaN.

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

You can apply common Pandas functions for working with missing values – isna(), fillna(), and dropna() to a category type column as well. Let’s look at them with the help of some examples.

Check if values in a category column are missing

You can use the Pandas isna() function to check whether a value in a column is a missing value or not. Let’s apply this function to the “Shirt Size” column in the above dataframe.

# check for missing values in "Shirt Size" column
print(df["Shirt Size"].isna())

Output:

0    False
1     True
2     True
3    False
4     True
Name: Shirt Size, dtype: bool

You can see that True for all the instances of NaN in the data.

Fill missing values in a category column

You can use the Pandas fillna() function to fill missing values in a Pandas category column. Note that the value used for filling the missing values should be one of the possible category values.

For example, let’s fill the missing values in the “Shirt Size” column with the category value “Medium”

# fill missing values with "Medium"
print(df["Shirt Size"].fillna("Medium"))

Output:

0     Large
1    Medium
2    Medium
3    Medium
4    Medium
Name: Shirt Size, dtype: category
Categories (2, object): ['Large', 'Medium']

We get a Pandas series with NaN filled by “Medium”. Note that we haven’t modified the original column values here, just printing out the return value from the fillna() function.

If you try to fill the missing value in a category column with a value other than a possible category value, it will result in an error.

# fill missing values with "Xtra Large"
print(df["Shirt Size"].fillna("Xtra Large"))

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [7], in <module>
      1 # fill missing values with "Xtra Large"
----> 2 print(df["Shirt Size"].fillna("Xtra Large"))

.....
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

Here, we try to fill the missing values in the “Shirt Size” column with “Xtra Large”, a value that is not a part of the possible category values and thus, we get an error.

Drop missing values in a category column

You can use the Pandas dropna() function to drop records with missing values in a Pandas categorical column.

First, let’s print out the dataframe that we’re working with.

# display the dataframe
print(df)

Output:

    Name Shirt Size
0    Tim      Large
1  Sarah        NaN
2  Hasan        NaN
3  Jyoti     Medium
4   Jack        NaN

Let’s now drop rows with missing values.

# drop rows with missing values
print(df.dropna())

Output:

    Name Shirt Size
0    Tim      Large
3  Jyoti     Medium

You can see that the resulting dataframe doesn’t have any missing values.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Author

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

Scroll to Top