Generally, when working with databases, it might happen that the information you’re looking for is spread across different tables. In such cases, you’d want to join those tables based on certain common keys to get the desired outcome. You can similarly join different dataframes together in pandas. In this tutorial, we’ll look at the different types of joins that you should be aware of and their implementations in pandas.
Overview
Before we proceed, here’s a quick overview of what will be covered in this tutorial.
- Understanding joins – Inner, Left, Right, and Outer.
- Implementing joins on pandas dataframes.
- Difference between pandas
merge()
,join()
, andconcat()
functions.
Understanding Joins
Joining is one of the most common operations when working with tables. As mentioned above, the data you require, may not necessarily be available as a single table. In which case, you’d be required to join relevant tables to get what you’re looking for. But, there are multiple ways to join tables together. You could do a left join, a right join, an inner join, an outer join, etc.
The choice of which join to use depends upon your specific use-case and it’s important to know what each of these represents so that the next time you’re in a situation to join tables, you know just the one join to use. Let’s have a better look at the four most used join operations –
1. Inner Join
An inner join is used to join tables keeping only the rows with common keys. The following diagrams illustrates an inner join applied on two tables.
In the above figure, green represents the rows included in the result while red represents the rows that are excluded. In an inner join only the rows that share common keys are kept from both the tables with the rest of the rows excluded in the results.
2. Left Join
In a left join (also known as left outer join), all the rows from the left table are included along with matching rows from the right table. For the left table rows that do not have a corresponding match in the right table, the result will include Null values (NaN if implementing join in pandas).
The above figure demonstrates a left join operation. The green section represents the rows that are included and the red section represents the rows that excluded in the result. The yellow section represents NaNs where the left table rows do not have a match in the right table.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
3. Right Join
In a right join (also known as right outer join), all the rows from the right table are included along with matching rows from the left table. For the right table rows that do not have a corresponding match in the left table, the result will include Null values (NaN if implementing join in pandas).
The above figure demonstrates a right join operation. The yellow section represents NaNs where the right table rows do not have a match in the left table.
4. Outer Join
In an outer join (also called a full outer join) all the rows from both the tables are included. For the rows that do not have a matching key in either of the tables, the result will include Null values (NaN if implementing join in pandas).
The above figure demonstrates an outer join operation. Note that, no rows have been dropped.
Implementing Joins in Pandas
Now that we have an understanding of what different joins do, let’s look at their implementation in pandas by joining dataframes. For the purpose of this tutorial, let’s create two dataframes representing tables that we want to join.
import pandas as pd
# stock names
df_names = pd.DataFrame({
'Name': ['Apple', 'Mircosoft', 'Tesla', 'Google', 'Netflix'],
'Symbol': ['APPL', 'MSFT', 'TSLA', 'GOOG', 'NFLX']
})
# sample porfolio
df_portfolio = pd.DataFrame({
'Symbol': ['TSLA', 'APPL', 'GOOG', 'AMZN'],
'Shares': [20, 50, 50, 100]
})
The dataframe df_names
stores some stock names and their symbols while the dataframe df_portfolio
stores the stock counts in a sample portfolio.
To implement a database like join in pandas, use the pandas merge()
function. The following is the syntax:
df_merged = pd.merge(df_left, df_right, on='Key', how='inner')
Here, we pass the left and right dataframes to be joined. The on
parameter takes the merge key. If you have keys with different names you can pass them to left_on
and right_on
respectively. The how
parameter determines the type of join to perform. It is 'inner'
by default.
Let’s look the code examples of each of the above joins discussed.
1. Inner Join in Pandas
To implement an inner join in pandas with the merge()
function, pass 'inner'
to the how
parameter.
df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='inner')
print(df_merged)
Output:
Name Symbol Shares
0 Apple APPL 50
1 Tesla TSLA 20
2 Google GOOG 50
Only the rows that had common values of the column “Symbol” across the two dataframes were included in the result of inner join.
2. Left Join in Pandas
To implement a left join in pandas with the merge()
function, pass 'left'
to the how
parameter.
df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='left')
print(df_merged)
Output:
Name Symbol Shares
0 Apple APPL 50.0
1 Mircosoft MSFT NaN
2 Tesla TSLA 20.0
3 Google GOOG 50.0
4 Netflix NFLX NaN
You can see that all the rows from the left dataframe df_names
were included along with matching values of Shares from the right dataframe df_portfolio
. Also, note that for the rows where a match wasn’t present in the right dataframe we get NaN values for the “Shares” column.
3. Right Join in Pandas
To implement a right join in pandas with the merge()
function, pass 'right'
to the how
parameter.
df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='right')
print(df_merged)
Output:
Name Symbol Shares
0 Apple APPL 50
1 Tesla TSLA 20
2 Google GOOG 50
3 NaN AMZN 100
You can see that all the rows from the right dataframe df_portfolio
were included along with matching values of Name from the left dataframe df_names
. Also, note that for the rows where a match wasn’t present in the left dataframe we get NaN values for the “Name” column.
4. Outer Join in Pandas
To implement an outer join in pandas with the merge()
function, pass 'outer'
to the how
parameter.
df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='outer')
print(df_merged)
Output:
Name Symbol Shares
0 Apple APPL 50.0
1 Mircosoft MSFT NaN
2 Tesla TSLA 20.0
3 Google GOOG 50.0
4 Netflix NFLX NaN
5 NaN AMZN 100.0
The resulting dataframe from the outer join included all the records from both the dataframes with NaNs where a match wasn’t found in either of the dataframes.
Other useful parameters of merge()
The pandas merge()
function comes with a number of additional useful parameters to help you better join the dataframes.
Join on Index
For example, if you want to join dataframes on indexes, you can use the left_index
and/or right_index
parameters which are False
by default.
# set the symbol column as index for df_names
df_names.set_index('Symbol', inplace=True)
# merge df_names on index and df_portfolio on 'Symbol' column
df_merged = pd.merge(df_names, df_portfolio, left_index=True, right_on='Symbol', how='inner')
print(df_merged)
Output:
Name Symbol Shares
1 Apple APPL 50
0 Tesla TSLA 20
2 Google GOOG 50
Here, we performed an inner join between the two dataframes using the index of the left and the “Symbol” column of the right dataframe.
Use suffixes to distinguish columns with the same names
You can use the suffixes
parameter to give custom suffixes (of length 2) to columns having same names from both the dataframes to avoid confusion post merging. For example, if the above two dataframes have an additional column with the name “Industry”
# stock names
df_names = pd.DataFrame({
'Name': ['Apple', 'Mircosoft', 'Tesla', 'Google', 'Netflix'],
'Symbol': ['APPL', 'MSFT', 'TSLA', 'GOOG', 'NFLX'],
'Industry': ['Tech', 'Tech', 'Auto', 'Tech', 'Content']
})
# sample porfolio
df_portfolio = pd.DataFrame({
'Symbol': ['TSLA', 'APPL', 'GOOG', 'AMZN'],
'Shares': [20, 50, 50, 100],
'Industry': ['Technology', 'Technology', 'Technology', 'Technology']
})
# merge with custom suffixes for common columns
df_merged = pd.merge(df_names, df_portfolio, on='Symbol', how='left', suffixes=('_n', '_p'))
print(df_merged)
Output:
Name Symbol Industry_n Shares Industry_p
0 Apple APPL Tech 50.0 Technology
1 Mircosoft MSFT Tech NaN NaN
2 Tesla TSLA Auto 20.0 Technology
3 Google GOOG Tech 50.0 Technology
4 Netflix NFLX Content NaN NaN
Here, both the dataframes have a column named “Industry”, we provided suffixes of length two as a tuple to the suffixes
parameter. Hence, the “Industry” column from the left dataframe has the suffix “_n” and the one from the right dataframe has the suffix “_p”.
There are other useful parameters as well to the merge()
function. For a complete list, refer to its documentation.
Difference Between pandas merge()
, join()
, and concat()
In the above examples, we saw the implementation of joining dataframes using the pandas merge()
function. There are, however, other pandas functions as well like join()
and concat()
each having its specific usage in joining/combining dataframes. Let’s look at the differences between them and when to use each.
join()
: The pandasjoin()
function can also be used to join dataframes but it is primarily used to join them on their indexes. You can, however, join the left dataframe on a column but the right dataframe must be joined on its index.concat()
: The pandasconcat()
function is used to concatenate rows or columns to a dataframe without looking at the values. That is, you do not match against a key when concatenating dataframes. For more, refer to our guide on dataframe concatenationmerge()
: The pandasmerge()
function is a very versatile function to implement joins. It offers a number of different options to customize your join operation. It’s mostly like the one you’d want to use if you want to do a database like join with pandas dataframes.
With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.