In this tutorial, we will look at how to drop (remove) duplicate columns from a pandas dataframe with the help of some examples.
How to remove duplicate columns from a pandas dataframe?
To correctly drop a duplicate column, you first need to define what it’s meant by a duplicate column. Do you consider columns having the same name (irrespective of the values) as duplicates? Or, do you decide whether columns are duplicates based on their values only?
Now, depending on how you choose to define duplicate columns there can be different ways to remove them from a pandas dataframe.
- To remove duplicate columns based on the column names, first, identify the duplicate columns and then remove them using the
.loc
property. - To remove duplicate columns based on the column values, transpose the dataframe, drop duplicate rows, and then transpose it back (see the examples below).
Examples
Let’s now look at some examples of using the above methods to drop duplicate columns from a dataframe.
Example 1 – Drop duplicate columns based on column names
In this method, we say two (or more columns) are duplicates if they have the same name irrespective of their values.
Let’s look at an example.
import pandas as pd # create pandas dataframe df = pd.DataFrame(list(zip( ["Jim", "Dwight", "Angela", "Tobi"], [26, 28, 27, 32], ["Sales", "Sales", "Accounting", "HR"], ["Sales", "Sales", "Accounting", "HR"], ["HR", "Sales", "Sales", "Sales"] )), columns=["Name", "Age", "Department", "Department", "Department"]) # display the dataframe df
Output:
Here, we created a dataframe with data of some employees in the offices. You can see that the above dataframe has three “Department” columns. The first two “Department” columns have the same values whereas the third “Department” column has different values.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
Now, if we are to drop duplicate columns based on their names, we first need to identify them. You can use the df.columns.duplicated()
method to get a boolean array representing whether columns are duplicates (are already present) or not.
First, let’s see the output of df.columns.duplicated()
for the above dataframe.
df.columns.duplicated()
Output:
array([False, False, False, True, True])
You can see that we get False
for the first three values which represent the non-duplicate columns – “Name”, “Age”, and the first “Department” column in the dataframe. We get True
for the last two columns which are the duplicate “Department” columns (as a column with this name is already present).
Now, if you want to remove the duplicate columns (and keep only the first occurrence of the column), utilize the above result as a boolean index with the .loc
property of the dataframe.
The following is the syntax –
# remove duplicate columns (based on column names) df = df.loc[:, ~df.columns.duplicated()]
Let’s now use this syntax on the above dataframe.
# remove duplicate columns (based on column names) df = df.loc[:, ~df.columns.duplicated()] # display the dataframe df
Output:
The resulting dataframe does not have the duplicate “Department” columns. Remember that here, we decided whether the columns are duplicates or not based on just their names and not their values.
Example 2 – Drop duplicate columns based on column values
In this method, we say two (or more) columns are duplicates if they have the same values (irrespective of the column names). This is the more likely use case when removing duplicate columns.
To remove columns having the same values, use the following steps –
- Transpose the dataframe (this will change rows to columns and columns to rows).
- Remove duplicate rows using
drop_duplicates()
. - Transpose the dataframe back (this will bring back our row and column configuration).
The idea is to remove the duplicate columns as duplicate rows of the transposed dataframe.
The following is the syntax –
# remove duplicate columns (based on column values) df = df.T.drop_duplicates().T
Let’s look at an example, we will use the same dataframe from above.
import pandas as pd # create pandas dataframe df = pd.DataFrame(list(zip( ["Jim", "Dwight", "Angela", "Tobi"], [26, 28, 27, 32], ["Sales", "Sales", "Accounting", "HR"], ["Sales", "Sales", "Accounting", "HR"], ["HR", "Sales", "Sales", "Sales"] )), columns=["Name", "Age", "Department", "Department", "Department"]) # display the dataframe df
Output:
We have a dataframe with three “Department” columns. The first two “Department” columns have the same values whereas the third “Department” column has different values.
Let’s remove the duplicate columns (columns having the same values) in the above dataframe.
# remove duplicate columns (based on column values) df = df.T.drop_duplicates().T # display the dataframe df
Output:
The resulting dataframe does not have any duplicate columns (based on column values).
Summary
In this tutorial, we looked at how to drop duplicate columns from a pandas dataframe. The following are the key takeaways.
- If duplicate columns are defined using their column names, use
df.columns.duplicated()
to identify them and then remove them using.loc
. - If duplicate columns are defined using their column values, remove the duplicate columns as rows in the transposed dataframe.
You might also be interested in –
- Pandas – Drop one or more Columns from a Dataframe
- Pandas – Delete All Columns Except Some Columns
- Drop Duplicates from a Pandas DataFrame
- Drop Rows with NaNs in Pandas DataFrame
- Pandas – Drop first n rows of a DataFrame
- Pandas – Drop last n rows of a DataFrame
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.