difference between pandas join and merge

Pandas – Join vs Merge

In this tutorial, we’ll look at the difference between pandas join() and merge() functions and when exactly should you use them. If you’re looking for a refresher on the different types of joins, you can refer to Understanding Joins in Pandas

Both the functions are used to perform joins on pandas dataframes but they’re used in different scenarios. The join() function is generally used to join dataframes on index whereas the merge() function is a more versatile function that lets you join dataframes on indexes as well as columns.

You can, however, use the join() function to join the left dataframe on a column but the right dataframe must be joined on its index. So, if you want to perform database-style joins on dataframes use the more versatile merge() function.

To join dataframes using the join() function, use the following syntax:

df_merged = df_left.join(df_right, how='left')

Note that, the join() function by default performs a join on index. If you want to join the left dataframe on a column, pass the column name to the on parameter. But remember, the right dataframe must be joined on its index. Also note, that the join() function performs a left join by default.

To join dataframes using the merge() function, use the following syntax:

df_merged = df_left.merge(df_right, on='Col1', how='inner')

Here, the on parameter is used to pass the column to merge the dataframes on and the how parameter is used to determine the kind of join to perform. The merge() function performs an inner join by default.

Let’s look at some example use-cases to illustrate the difference between the two.

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

Let’s merge two dataframes on their indexes using join() and merge(). For this, we’ll create two dataframes “df_names” and “df_portfolio”. Both having a common column “Symbol”.

import pandas as pd

# stock names
df_names = pd.DataFrame({
    'Name': ['Apple', 'Mircosoft', 'Tesla', 'Google', 'Netflix'],
    'Symbol': ['APPL', 'MSFT', 'TSLA', 'GOOG', 'NFLX']
})

# sample porfolio
df_portfolio = pd.DataFrame({
    'Symbol': ['TSLA', 'APPL', 'GOOG', 'AMZN'],
    'Shares': [20, 50, 50, 100]
})

The two dataframes look as follows:

two dataframes to merge

Now, let’s set the “Symbol” column as the index for both the dataframes.

df_names.set_index('Symbol', inplace=True)
df_portfolio.set_index('Symbol', inplace=True)

This is how the dataframes look after setting the index:

Left and right dataframes with symbol column as axis

Left Join using pandas join()

Now, let’s perform a left join on the two dataframes by merging them on their indexes using the join() function.

df_merged = df_names.join(df_portfolio, how='left')
print(df_merged)

Output:

             Name  Shares
Symbol                   
APPL        Apple    50.0
MSFT    Mircosoft     NaN
TSLA        Tesla    20.0
GOOG       Google    50.0
NFLX      Netflix     NaN

You can see the resulting dataframe with NaNs where a match was not found in the right dataframe, df_portfolio.

Left Join using pandas merge()

Let’s perform the same left join on index as above but this time using the merge() function.

df_merged = df_names.merge(df_portfolio, left_index=True, right_index=True, how='left')
print(df_merged)

Output:

             Name  Shares
Symbol                   
APPL        Apple    50.0
MSFT    Mircosoft     NaN
TSLA        Tesla    20.0
GOOG       Google    50.0
NFLX      Netflix     NaN

You can see that we get the same result as we got with the join() function. Note that, the merge() function required passing values for parameters left_index and right_index to specify that we wanted to perform a merge on the indexes.

Now, let’s merge the left dataframe on the column “Symbol” and the right dataframe on its index. First, we’ll recreate the two dataframes and then set the “Symbol” column as the index only for the right dataframe.

import pandas as pd

# stock names
df_names = pd.DataFrame({
    'Name': ['Apple', 'Mircosoft', 'Tesla', 'Google', 'Netflix'],
    'Symbol': ['APPL', 'MSFT', 'TSLA', 'GOOG', 'NFLX']
})

# sample porfolio
df_portfolio = pd.DataFrame({
    'Symbol': ['TSLA', 'APPL', 'GOOG', 'AMZN'],
    'Shares': [20, 50, 50, 100]
})

# set Symbol as axis for df_names
df_portfolio.set_index('Symbol', inplace=True)

This is how the two dataframes look:

Left Join using join()

df_merged = df_names.join(df_portfolio, on='Symbol', how='left')
print(df_merged)

Output:

        Name Symbol  Shares
0      Apple   APPL    50.0
1  Mircosoft   MSFT     NaN
2      Tesla   TSLA    20.0
3     Google   GOOG    50.0
4    Netflix   NFLX     NaN

You can see that we were able to merge the left dataframe on the column “Symbol” and the right datafrome on its index using the pandas join() function.

Left Join using merge()

df_merged = df_names.merge(df_portfolio, left_on='Symbol', right_index=True, how='left')
print(df_merged)

Output:

        Name Symbol  Shares
0      Apple   APPL    50.0
1  Mircosoft   MSFT     NaN
2      Tesla   TSLA    20.0
3     Google   GOOG    50.0
4    Netflix   NFLX     NaN

The pandas merge() function was able to merge the left dataframe on the column “Symbol” and the right one on its index. Note that, we had to pass right_index=True to indicate that the right dataframe should be merged on its index.

So far so good.

Now, let’s merge the left dataframe on its index and the right dataframe on the column “Symbol”. First, we’ll recreate the two dataframes and then set the “Symbol” column as the index only for the left dataframe.

import pandas as pd

# stock names
df_names = pd.DataFrame({
    'Name': ['Apple', 'Mircosoft', 'Tesla', 'Google', 'Netflix'],
    'Symbol': ['APPL', 'MSFT', 'TSLA', 'GOOG', 'NFLX']
})

# sample porfolio
df_portfolio = pd.DataFrame({
    'Symbol': ['TSLA', 'APPL', 'GOOG', 'AMZN'],
    'Shares': [20, 50, 50, 100]
})

# set Symbol as axis for df_names
df_names.set_index('Symbol', inplace=True)

This is how the two dataframes look:

Left and right dataframes before join.

Left Join using join()

If you try to join the left dataframe on index and the right dataframe on a column using join(), it’ll result in an error.

df_merged = df_names.join(df_portfolio, on='Symbol', how='left')
print(df_merged)

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
----> 1 df_merged = df_names.join(df_portfolio, on='Symbol', how='left')
      2 print(df_merged)

** lines skipped to keep the error message short **

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

This happened because the join() function requires the right dataframe to be merged on its index.

Left Join using merge()

The merge() function allows you the flexibility to merge the left dataframe on index and the right one on a column.

df_merged = df_names.merge(df_portfolio, left_index=True, right_on='Symbol', how='left')
print(df_merged)

Output:

          Name Symbol  Shares
1.0      Apple   APPL    50.0
NaN  Mircosoft   MSFT     NaN
0.0      Tesla   TSLA    20.0
2.0     Google   GOOG    50.0
NaN    Netflix   NFLX     NaN

Note that, we had to pass left_index=True to indicate that the left dataframe should be merged on its index.

Now, let’s merge the two dataframes directly on the common column “Symbol”.

import pandas as pd

# stock names
df_names = pd.DataFrame({
    'Name': ['Apple', 'Mircosoft', 'Tesla', 'Google', 'Netflix'],
    'Symbol': ['APPL', 'MSFT', 'TSLA', 'GOOG', 'NFLX']
})

# sample porfolio
df_portfolio = pd.DataFrame({
    'Symbol': ['TSLA', 'APPL', 'GOOG', 'AMZN'],
    'Shares': [20, 50, 50, 100]
})

This is how the dataframes look:

two dataframes to merge

Left join using join()

The pandas join() function results in an error if you try to merge the two dataframes on a common column.

df_merged = df_names.join(df_portfolio, on='Symbol', how='left')
print(df_merged)

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
----> 1 df_merged = df_names.join(df_portfolio, on='Symbol', how='left')
      2 print(df_merged)

** lines skipped to keep the error message short **

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

Left Join using pandas merge

If the column to merge on has the same name in the two dataframes, pass it to the on parameter. If they have different names, pass the respective names to left_on and right_on parameters.

df_merged = df_names.merge(df_portfolio, on='Symbol', how='left')
print(df_merged)

Output:

        Name Symbol  Shares
0      Apple   APPL    50.0
1  Mircosoft   MSFT     NaN
2      Tesla   TSLA    20.0
3     Google   GOOG    50.0
4    Netflix   NFLX     NaN

The pandas merge() and join() functions are used to join dataframes but have different use-cases. If you want to perform database-style joins on dataframes prefer the merge() function as it gives you more flexibility on how to perform the merge. If you, instead, want to join dataframes specifically on their indexes use the join() function.

The following table compares the two functions on the different combinations of joining dataframe on column or index.

DataFrame Mergejoin()merge()
On index✔️✔️
Left on column, right on index✔️✔️
Left on index, right on column✔️
On column✔️

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Tutorials on combining data in pandas –

References

  1. Pandas merge() function documentation.
  2. Pandas join() function documentation.

Author

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

Scroll to Top