In this tutorial, we’ll try to understand how to plot histograms by group in pandas with the help of some examples.
Plotting histograms using grouped data from a pandas DataFrame creates one histogram for each group in the DataFrame. For example, you group the data by values of column 1 and then show the distribution of values in column 2 for each group of data points using a histogram.
You can use the following methods to plot histograms by group in pandas:
- Plot Histograms by Group Using Multiple Plots – one histogram for each group
- Plot Histograms by Group Using One Plot – all the histograms on a single plot
Let’s now look at both methods in detail.
Method 1 – Plot Histograms by Group Using Multiple Plots
You can use the pandas.DataFrame.hist()
method to create histograms for different groups of data. Each group is plotted on a separate subplot.
You can specify the column to group the data by using the by
parameter and the column to show the distribution of using the column
parameter. You can also directly apply this method to an individual column of the dataframe and just specify the column(s) to group the data on.
The following is the syntax –
Basic Syntax:
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
DataFrame.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, **kwargs)
Parameters:
- data – The pandas object holding the data.
- column – If passed, will be used to limit data to a subset of columns.
- by – If passed, then used to form histograms for separate groups.
For more details about the arguments, refer this.
Now let us understand the above method with some worked out examples.
Example 1 – Plot histogram of column values by group in a pandas dataframe on separate plots
Let’s create a pandas dataframe with two columns – “col1”, a column storing categorical data which will be used to group the data, and “col2”, a column with numerical data.
And then, plot the distribution of the values in “col2” for each group (decided by “col1” values) using the pandas hist()
function.
import pandas as pd import numpy as np #create DataFrame df = pd.DataFrame({'col1': np.repeat(['W','X', 'Y', 'Z'], 50), 'col2': np.random.normal(loc=10, scale=2, size=200)}) #Plotting the histogram by group in multiple plots df['col2'].hist(by=df['col1'])
Output:
In the above example, we –
- Import the required modules.
- Create a dataframe with the first column filled with values
W, X, Y, Z
each 50 times, then filled the second column with numerical values usingnumpy.random.normal
(refer this). - Plot the histogram by group in multiple plots using the pandas
hist()
function.
Example 2 – Histgrom by group in multiple plots with customizations
You can customize the resulting plots by passing additional parameters to the pandas hist()
function, for example, let’s change the edge color of the histogram bars to red.
import pandas as pd import numpy as np #create DataFrame df = pd.DataFrame({'col1': np.repeat(['W','X', 'Y', 'Z'], 50), 'col2': np.random.normal(loc=10, scale=2, size=200)}) #Plotting the histogram by group in multiple plots df['col2'].hist(by=df['col1'], edgecolor='red', figsize = (8,6))
Output:
The histogram bars now have a red edge.
Method 2 – Plot Histograms by Group in One Plot
You can use the matplotlib.pyplot.hist()
function to plot the histograms of groups of data in a single plot. This type of histogram shows the level of overlap in the distribution of the values across different groups.
The following is the syntax –
Basic Syntax:
matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None, cumulative=False, bottom=None, histtype='bar', align='mid', orientation='vertical', rwidth=None, log=False, color=None, label=None, stacked=False, *, data=None, **kwargs)
Parameters:
- x: Input values, this takes either a single array or a sequence of arrays that are not required to be of the same length.
- range: The lower and upper range of the bins. Lower and upper outliers are ignored. If not provided, range is
(x.min(), x.max())
. Range has no effect if bins is a sequence.
For more details about the arguments, refer this.
Now let us understand the usage of this method with an example.
We’ll take the same dataframe as above – a dataframe with two columns – “col1”, a column storing categorical data which will be used to group the data, and “col2”, a column with numerical data.
Now, the matplotlib.pyplot.hist()
function, by itself, cannot group the data. So we’ll have to group the data separately, and then plot the histogram for each group on the same plot.
import pandas as pd import numpy as np import matplotlib.pyplot as plt #create DataFrame df = pd.DataFrame({'col1': np.repeat(['W','X', 'Y', 'z'], 50), 'col2': np.random.normal(loc=10, scale=2, size=200)}) #define points values by group W = df.loc[df['col1'] == 'W', 'col2'] X = df.loc[df['col1'] == 'X', 'col2'] Y = df.loc[df['col1'] == 'Y', 'col2'] Z = df.loc[df['col1'] == 'Z', 'col2'] #add four histograms to one plot plt.hist(W, alpha=0.5, label='W') plt.hist(X, alpha=0.5, label='X') plt.hist(Y, alpha=0.5, label='Y') plt.hist(Z, alpha=0.5, label='Z') plt.legend(title='Col2') plt.show()
Output:
In the above example, we –
- Import the required modules.
- Create a dataframe with the first column filled with values
W, X, Y, Z
each 50 times, then filled the second column with numerical values usingnumpy.random.normal
(refer this). - Group the data based on the values in “col1”.
- Plot the histogram for each group in the same plot using the
matplotlib.pyplot.hist()
function. Note that we use thealpha
parameter to make the histograms more transparent so that we can easily see the overlap.
You might also be interested in –
- How to Create a Contour Plot in Matplotlib
- Pandas – Plot Multiple Dataframes in Subplots
- How to Create Multiple Matplotlib Plots in One Figure?
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.