Skip to Content

Understanding Joins in Pandas

Generally, when working with databases, it might happen that the information you’re looking for is spread across different tables. In such cases, you’d want to join those tables based on certain common keys to get the desired outcome. You can similarly join different dataframes together in pandas. In this tutorial, we’ll look at the different types of joins that you should be aware of and their implementations in pandas.

Before we proceed, here’s a quick overview of what will be covered in this tutorial.

  • Understanding joins – Inner, Left, Right, and Outer.
  • Implementing joins on pandas dataframes.
  • Difference between pandas merge(), join(), and concat() functions.

Joining is one of the most common operations when working with tables. As mentioned above, the data you require, may not necessarily be available as a single table. In which case, you’d be required to join relevant tables to get what you’re looking for. But, there are multiple ways to join tables together. You could do a left join, a right join, an inner join, an outer join, etc.

The choice of which join to use depends upon your specific use-case and it’s important to know what each of these represents so that the next time you’re in a situation to join tables, you know just the one join to use. Let’s have a better look at the four most used join operations –

An inner join is used to join tables keeping only the rows with common keys. The following diagrams illustrates an inner join applied on two tables.

Figure showing an inner join between tables through rectangles

In the above figure, green represents the rows included in the result while red represents the rows that are excluded. In an inner join only the rows that share common keys are kept from both the tables with the rest of the rows excluded in the results.

In a left join (also known as left outer join), all the rows from the left table are included along with matching rows from the right table. For the left table rows that do not have a corresponding match in the right table, the result will include Null values (NaN if implementing join in pandas).

Left Join operation represented by rectangles

The above figure demonstrates a left join operation. The green section represents the rows that are included and the red section represents the rows that excluded in the result. The yellow section represents NaNs where the left table rows do not have a match in the right table.

In a right join (also known as right outer join), all the rows from the right table are included along with matching rows from the left table. For the right table rows that do not have a corresponding match in the left table, the result will include Null values (NaN if implementing join in pandas).

Right Join Operation represented by rectangles

The above figure demonstrates a right join operation. The yellow section represents NaNs where the right table rows do not have a match in the left table.

In an outer join (also called a full outer join) all the rows from both the tables are included. For the rows that do not have a matching key in either of the tables, the result will include Null values (NaN if implementing join in pandas).

Outer Join operation represented by rectangles

The above figure demonstrates an outer join operation. Note that, no rows have been dropped.

Now that we have an understanding of what different joins do, let’s look at their implementation in pandas by joining dataframes. For the purpose of this tutorial, let’s create two dataframes representing tables that we want to join.

import pandas as pd

# stock names
df_names = pd.DataFrame({
    'Name': ['Apple', 'Mircosoft', 'Tesla', 'Google', 'Netflix'],
    'Symbol': ['APPL', 'MSFT', 'TSLA', 'GOOG', 'NFLX']
})

# sample porfolio
df_portfolio = pd.DataFrame({
    'Symbol': ['TSLA', 'APPL', 'GOOG', 'AMZN'],
    'Shares': [20, 50, 50, 100]
})
Left and right dataframe

The dataframe df_names stores some stock names and their symbols while the dataframe df_portfolio stores the stock counts in a sample portfolio.

To implement a database like join in pandas, use the pandas merge() function. The following is the syntax:

df_merged = pd.merge(df_left, df_right, on='Key', how='inner')

Here, we pass the left and right dataframes to be joined. The on parameter takes the merge key. If you have keys with different names you can pass them to left_on and right_on respectively. The how parameter determines the type of join to perform. It is 'inner' by default.

Let’s look the code examples of each of the above joins discussed.

To implement an inner join in pandas with the merge() function, pass 'inner' to the how parameter.

df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='inner')
print(df_merged)

Output:

     Name Symbol  Shares
0   Apple   APPL      50
1   Tesla   TSLA      20
2  Google   GOOG      50

Only the rows that had common values of the column “Symbol” across the two dataframes were included in the result of inner join.

To implement a left join in pandas with the merge() function, pass 'left' to the how parameter.

df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='left')
print(df_merged)

Output:

        Name Symbol  Shares
0      Apple   APPL    50.0
1  Mircosoft   MSFT     NaN
2      Tesla   TSLA    20.0
3     Google   GOOG    50.0
4    Netflix   NFLX     NaN

You can see that all the rows from the left dataframe df_names were included along with matching values of Shares from the right dataframe df_portfolio. Also, note that for the rows where a match wasn’t present in the right dataframe we get NaN values for the “Shares” column.

To implement a right join in pandas with the merge() function, pass 'right' to the how parameter.

df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='right')
print(df_merged)

Output:

     Name Symbol  Shares
0   Apple   APPL      50
1   Tesla   TSLA      20
2  Google   GOOG      50
3     NaN   AMZN     100

You can see that all the rows from the right dataframe df_portfolio were included along with matching values of Name from the left dataframe df_names. Also, note that for the rows where a match wasn’t present in the left dataframe we get NaN values for the “Name” column.

To implement an outer join in pandas with the merge() function, pass 'outer' to the how parameter.

df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='outer')
print(df_merged)

Output:

        Name Symbol  Shares
0      Apple   APPL    50.0
1  Mircosoft   MSFT     NaN
2      Tesla   TSLA    20.0
3     Google   GOOG    50.0
4    Netflix   NFLX     NaN
5        NaN   AMZN   100.0

The resulting dataframe from the outer join included all the records from both the dataframes with NaNs where a match wasn’t found in either of the dataframes.

The pandas merge() function comes with a number of additional useful parameters to help you better join the dataframes.

Join on Index
For example, if you want to join dataframes on indexes, you can use the left_index and/or right_index parameters which are False by default.

# set the symbol column as index for df_names
df_names.set_index('Symbol', inplace=True)

# merge df_names on index and df_portfolio on 'Symbol' column
df_merged = pd.merge(df_names, df_portfolio, left_index=True, right_on='Symbol', how='inner')
print(df_merged)

Output:

     Name Symbol  Shares
1   Apple   APPL      50
0   Tesla   TSLA      20
2  Google   GOOG      50

Here, we performed an inner join between the two dataframes using the index of the left and the “Symbol” column of the right dataframe.

Use suffixes to distinguish columns with the same names
You can use the suffixes parameter to give custom suffixes (of length 2) to columns having same names from both the dataframes to avoid confusion post merging. For example, if the above two dataframes have an additional column with the name “Industry”

# stock names
df_names = pd.DataFrame({
    'Name': ['Apple', 'Mircosoft', 'Tesla', 'Google', 'Netflix'],
    'Symbol': ['APPL', 'MSFT', 'TSLA', 'GOOG', 'NFLX'],
    'Industry': ['Tech', 'Tech', 'Auto', 'Tech', 'Content']
})

# sample porfolio
df_portfolio = pd.DataFrame({
    'Symbol': ['TSLA', 'APPL', 'GOOG', 'AMZN'],
    'Shares': [20, 50, 50, 100],
    'Industry': ['Technology', 'Technology', 'Technology', 'Technology']
})

# merge with custom suffixes for common columns
df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='left', suffixes=('_n', '_p'))
print(df_merged)

Output:

        Name Symbol Industry_n  Shares  Industry_p
0      Apple   APPL       Tech    50.0  Technology
1  Mircosoft   MSFT       Tech     NaN         NaN
2      Tesla   TSLA       Auto    20.0   Technology
3     Google   GOOG       Tech    50.0  Technology
4    Netflix   NFLX    Content     NaN         NaN

Here, both the dataframes have a column named “Industry”, we provided suffixes of length two as a tuple to the suffixes parameter. Hence, the “Industry” column from the left dataframe has the suffix “_n” and the one from the right dataframe has the suffix “_p”.

There are other useful parameters as well to the merge() function. For a complete list, refer to its documentation.

In the above examples, we saw the implementation of joining dataframes using the pandas merge() function. There are, however, other pandas functions as well like join() and concat() each having its specific usage in joining/combining dataframes. Let’s look at the differences between them and when to use each.

  1. join(): The pandas join() function can also be used to join dataframes but it is primarily used to join them on their indexes. You can, however, join the left dataframe on a column but the right dataframe must be joined on its index.
  2. concat(): The pandas concat() function is used to concatenate rows or columns to a dataframe without looking at the values. That is, you do not match against a key when concatenating dataframes. For more, refer to our guide on dataframe concatenation
  3. merge(): The pandas merge() function is a very versatile function to implement joins. It offers a number of different options to customize your join operation. It’s mostly like the one you’d want to use if you want to do a database like join with pandas dataframes.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Author

  • Piyush is a data scientist passionate about using data to understand things better and make informed decisions. In the past, he's worked as a Data Scientist for ZS and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.