Skip to Content

Join Category Columns with Pandas union_categoricals()

Pandas is a popular library for data manipulation in Python. It comes with a handy category dtype for categorical data. It also has useful functions to help you work with categorical data. In this tutorial, we will look at how to join two category type series in Pandas using the Pandas union_categoricals() function.

Pandas concat() vs union_categoricals()

You can use both the Pandas concat() function and the union_categoricals() function to combine category type data. For example, both the functions give similar outcomes when combining category data having the same categories.

import pandas as pd
from pandas.api.types import union_categoricals

# create category type pandas series
s1 = pd.Series(['a', 'b']).astype('category')
s2 = pd.Series(['b', 'a', 'a']).astype('category')

# combine the series with concat()
print(pd.concat([s1, s2]))
print("-------")
# combine the series with union_categoricals()
print(union_categoricals([s1, s2]))

Output:

0    a
1    b
0    b
1    a
2    a
dtype: category
Categories (2, object): ['a', 'b']
-------
['a', 'b', 'b', 'a', 'a']
Categories (2, object): ['a', 'b']

Here, we combine two category type Pandas series, s1 and s2, both having the same categories – “a” and “b”.

Note that, with concat() we get a Pandas series whereas with union_categoricals() we get a Pandas categorical array. You can see that both the methods result in a category type outcome with the same unique categories.

But, if you use the Pandas concat() function to join categorical series with different category values, the resulting series is not of category type.

# create category type pandas series
s1 = pd.Series(['a', 'b']).astype('category')
s2 = pd.Series(['b', 'd', 'c']).astype('category')

# combine the series with concat()
print(pd.concat([s1, s2]))

Output:

0    a
1    b
0    b
1    d
2    c
dtype: object

Let’s now join the two series with different category values using the Pandas union_categoricals() function.

# create category type pandas series
s1 = pd.Series(['a', 'b']).astype('category')
s2 = pd.Series(['b', 'd', 'c']).astype('category')

# combine the series with union_categoricals()
print(union_categoricals([s1, s2]))

Output:

['a', 'b', 'b', 'd', 'c']
Categories (4, object): ['a', 'b', 'c', 'd']

You can see that the outcome is categorical with categories from both the joined series.

Thus, a key advantage of using union_categories() is that you can use it to join categorical data with different categories into a category type outcome.

Pandas union_categoricals() on ordered categorical data

You can also use the union_categoricals() on ordered categorical data.

# create category type pandas series
s1 = pd.Series(['a', 'b']).astype('category')
s1 = s1.cat.set_categories(['a', 'b'], ordered=True)
s2 = pd.Series(['b', 'a', 'b']).astype('category')
s2 = s2.cat.set_categories(['a', 'b'], ordered=True)

# combine the series with union_categoricals()
print(union_categoricals([s1, s2]))

Output:

['a', 'b', 'b', 'a', 'b']
Categories (2, object): ['a' < 'b']

Here, we combine two series having the same categories and the same category order. You can see that the result is also an ordered categorical having the same category order.

Let’s see what happens if you combine two series with different categories or orders.

# create category type pandas series
s1 = pd.Series(['a', 'b']).astype('category')
s1 = s1.cat.set_categories(['a', 'b'], ordered=True)
s2 = pd.Series(['b', 'a', 'c']).astype('category')
s2 = s2.cat.set_categories(['a', 'b', 'c'], ordered=True)

# combine the series with union_categoricals()
print(union_categoricals([s1, s2]))

Output:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [17], in <module>
      5 s2 = s2.cat.set_categories(['a', 'b', 'c'], ordered=True)
      7 # combine the series with union_categoricals()
----> 8 print(union_categoricals([s1, s2]))
...
TypeError: to union ordered Categoricals, all categories must be the same

We get an error since to combine ordered categoricals using union_categoricals(), all categories (and their relative order) must be the same.

To join ordered categoricals with different categories or orders, you can use the ignore_order=True argument.

# create category type pandas series
s1 = pd.Series(['a', 'b']).astype('category')
s1 = s1.cat.set_categories(['a', 'b'], ordered=True)
s2 = pd.Series(['b', 'a', 'c']).astype('category')
s2 = s2.cat.set_categories(['a', 'b', 'c'], ordered=True)

# combine the series with union_categoricals()
print(union_categoricals([s1, s2], ignore_order=True))

Output:

['a', 'b', 'b', 'a', 'c']
Categories (3, object): ['a', 'b', 'c']

We didn’t get an error this time. Notice that the resulting categorical is unordered.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Author

  • Piyush

    Piyush is a data scientist passionate about using data to understand things better and make informed decisions. In the past, he's worked as a Data Scientist for ZS and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.