Skip to Content

Pandas – Drop Duplicate Columns From Dataframe

In this tutorial, we will look at how to drop (remove) duplicate columns from a pandas dataframe with the help of some examples.

How to remove duplicate columns from a pandas dataframe?

To correctly drop a duplicate column, you first need to define what it’s meant by a duplicate column. Do you consider columns having the same name (irrespective of the values) as duplicates? Or, do you decide whether columns are duplicates based on their values only?

Now, depending on how you choose to define duplicate columns there can be different ways to remove them from a pandas dataframe.

  • To remove duplicate columns based on the column names, first, identify the duplicate columns and then remove them using the .loc property.
  • To remove duplicate columns based on the column values, transpose the dataframe, drop duplicate rows, and then transpose it back (see the examples below).

Examples

Let’s now look at some examples of using the above methods to drop duplicate columns from a dataframe.

Example 1 – Drop duplicate columns based on column names

In this method, we say two (or more columns) are duplicates if they have the same name irrespective of their values.

Let’s look at an example.

import pandas as pd

# create pandas dataframe
df = pd.DataFrame(list(zip(
        ["Jim", "Dwight", "Angela", "Tobi"],
        [26, 28, 27, 32],
        ["Sales", "Sales", "Accounting", "HR"],
        ["Sales", "Sales", "Accounting", "HR"],
        ["HR", "Sales", "Sales", "Sales"]
)), columns=["Name", "Age", "Department", "Department", "Department"])

# display the dataframe
df

Output:

pandas dataframe with three "Department" columns

Here, we created a dataframe with data of some employees in the offices. You can see that the above dataframe has three “Department” columns. The first two “Department” columns have the same values whereas the third “Department” column has different values.

Now, if we are to drop duplicate columns based on their names, we first need to identify them. You can use the df.columns.duplicated() method to get a boolean array representing whether columns are duplicates (are already present) or not.

First, let’s see the output of df.columns.duplicated() for the above dataframe.

df.columns.duplicated()

Output:

array([False, False, False,  True,  True])

You can see that we get False for the first three values which represent the non-duplicate columns – “Name”, “Age”, and the first “Department” column in the dataframe. We get True for the last two columns which are the duplicate “Department” columns (as a column with this name is already present).

Now, if you want to remove the duplicate columns (and keep only the first occurrence of the column), utilize the above result as a boolean index with the .loc property of the dataframe.

The following is the syntax –

# remove duplicate columns (based on column names)
df = df.loc[:, ~df.columns.duplicated()]

Let’s now use this syntax on the above dataframe.

# remove duplicate columns (based on column names)
df = df.loc[:, ~df.columns.duplicated()]
# display the dataframe
df

Output:

employee dataframe with duplicate columns (based on column names) removed

The resulting dataframe does not have the duplicate “Department” columns. Remember that here, we decided whether the columns are duplicates or not based on just their names and not their values.

Example 2 – Drop duplicate columns based on column values

In this method, we say two (or more) columns are duplicates if they have the same values (irrespective of the column names). This is the more likely use case when removing duplicate columns.

To remove columns having the same values, use the following steps –

  1. Transpose the dataframe (this will change rows to columns and columns to rows).
  2. Remove duplicate rows using drop_duplicates().
  3. Transpose the dataframe back (this will bring back our row and column configuration).

The idea is to remove the duplicate columns as duplicate rows of the transposed dataframe.

The following is the syntax –

# remove duplicate columns (based on column values)
df = df.T.drop_duplicates().T

Let’s look at an example, we will use the same dataframe from above.

import pandas as pd

# create pandas dataframe
df = pd.DataFrame(list(zip(
        ["Jim", "Dwight", "Angela", "Tobi"],
        [26, 28, 27, 32],
        ["Sales", "Sales", "Accounting", "HR"],
        ["Sales", "Sales", "Accounting", "HR"],
        ["HR", "Sales", "Sales", "Sales"]
)), columns=["Name", "Age", "Department", "Department", "Department"])

# display the dataframe
df

Output:

pandas dataframe with three "Department" columns

We have a dataframe with three “Department” columns. The first two “Department” columns have the same values whereas the third “Department” column has different values.

Let’s remove the duplicate columns (columns having the same values) in the above dataframe.

# remove duplicate columns (based on column values)
df = df.T.drop_duplicates().T
# display the dataframe
df

Output:

employee dataframe with duplicate columns (based on column values) removed

The resulting dataframe does not have any duplicate columns (based on column values).

Summary

In this tutorial, we looked at how to drop duplicate columns from a pandas dataframe. The following are the key takeaways.

  • If duplicate columns are defined using their column names, use df.columns.duplicated() to identify them and then remove them using .loc.
  • If duplicate columns are defined using their column values, remove the duplicate columns as rows in the transposed dataframe.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Author

  • Piyush

    Piyush is a data scientist passionate about using data to understand things better and make informed decisions. In the past, he's worked as a Data Scientist for ZS and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.