Compare Two DataFrames for Equality in Pandas

While working with pandas dataframes, it may happen that you require to check whether two dataframes are same or not. In this tutorial, we’ll look at how to compare two pandas dataframes for equality along with some examples.

The pandas dataframe function equals() is used to compare two dataframes for equality. It returns True if the two dataframes have the same shape and elements. For two dataframes to be equal, the elements should have the same dtype. The column headers, however, do not need to have the same dtype. The following is the syntax:

df1.equals(df2)

Here, df1 and df2 are the two dataframes you want to compare. Note that NaNs in the same location are considered equal.

Let’s see using some examples of how the equals() function works and what to expect when using it to compare two dataframes.

import pandas as pd

# two identical dataframes
df1 = pd.DataFrame({'A': [1,2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [1,2], 'B': ['x', 'y']})

# print the two dataframes
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# check if both are equal
print(df1.equals(df2))

Output:

DataFrame df1:
   A  B
0  1  x
1  2  y

DataFrame df2:
   A  B
0  1  x
1  2  y
True

In the above example, two dataframes df1 and df2 are compared for equality using the equals() method. Since the dataframes are exactly similar (1. values and datatypes of elements are the same and values and 2. datatypes of row and column labels are the same) True is returned.

import pandas as pd
import numpy as np

# two identical dataframes
df1 = pd.DataFrame({'A': [1,np.nan], 'B': ['x', None]})
df2 = pd.DataFrame({'A': [1,np.nan], 'B': ['x', None]})

# print the two dataframes
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# check if both are equal
print("\nAre both equal?")
print(df1.equals(df2))

Output:

DataFrame df1:
     A     B
0  1.0     x
1  NaN  None

DataFrame df2:
     A     B
0  1.0     x
1  NaN  None

Are both equal?
True

In the above example, you can see that NaNs and None are considered equal if they occur at the same location.

import pandas as pd
import numpy as np

# two identical dataframes
df1 = pd.DataFrame({'A': [1,2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [1.0,2.0], 'B': ['x', 'y']})

# print the two dataframes
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# check if both are equal
print("\nAre both equal?")
print(df1.equals(df2))

Output:

DataFrame df1:
   A  B
0  1  x
1  2  y

DataFrame df2:
     A  B
0  1.0  x
1  2.0  y

Are both equal?
False

In the above example, the column A has equal values but different dtypes in dataframes df1 and df2 hence we get False. For the dataframes to be equal the elements should have the same values and same dtypes.

Will the dataframes be equal if the column names are equal but have different dtypes given that the elements are the same?

import pandas as pd
import numpy as np

# two identical dataframes
df1 = pd.DataFrame({1: [1,2], 'B': ['x', 'y']})
df2 = pd.DataFrame({1.0: [1,2], 'B': ['x', 'y']})

# print the two dataframes
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# check if both are equal
print("\nAre both equal?")
print(df1.equals(df2))

Output:

DataFrame df1:
   1  B
0  1  x
1  2  y

DataFrame df2:
   1.0  B
0    1  x
1    2  y

Are both equal?
True

In the above example we find that dtypes of column names does not matter so long as they are equal.

What will the equals() function return if two dataframes have the same elements but different column names?

import pandas as pd
import numpy as np

# two identical dataframes
df1 = pd.DataFrame({'A': [1,2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'C': [1,2], 'D': ['x', 'y']})

# print the two dataframes
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# check if both are equal
print("\nAre both equal?")
print(df1.equals(df2))

Output:

DataFrame df1:
   A  B
0  1  x
1  2  y

DataFrame df2:
   C  D
0  1  x
1  2  y

Are both equal?
False

In the above example, we see that the elements of the dataframes df1 and df2 are the same but since the column names are different both the dataframes cannot be said to be equal.

import pandas as pd
import numpy as np

# two identical dataframes
df1 = pd.DataFrame({'A': [1,2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [1,2], 'B': ['x', 'y']})
# change the index of df2
df2.index = ['i', 'j']

# print the two dataframes
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# check if both are equal
print("\nAre both equal?")
print(df1.equals(df2))

Output:

DataFrame df1:
   A  B
0  1  x
1  2  y

DataFrame df2:
   A  B
i  1  x
j  2  y

Are both equal?
False

In the above example, we can see that as was the case with column names, dataframes having different indices cannot be said to be equal even if they have the same elements. If you want to compare two dataframes with different index schemes, first reset the index and then check for equality.

For more on the pandas dataframe equals() function, refer to its official documentation.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5 and numpy version 1.18.5


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.