pandas plot histograms by group

How to Plot Histograms by Group in Pandas

In this tutorial, we’ll try to understand how to plot histograms by group in pandas with the help of some examples.

Plotting histograms using grouped data from a pandas DataFrame creates one histogram for each group in the DataFrame. For example, you group the data by values of column 1 and then show the distribution of values in column 2 for each group of data points using a histogram.

You can use the following methods to plot histograms by group in pandas:

  • Plot Histograms by Group Using Multiple Plots – one histogram for each group
  • Plot Histograms by Group Using One Plot – all the histograms on a single plot

Let’s now look at both methods in detail.

Method 1 – Plot Histograms by Group Using Multiple Plots

You can use the pandas.DataFrame.hist() method to create histograms for different groups of data. Each group is plotted on a separate subplot.

You can specify the column to group the data by using the by parameter and the column to show the distribution of using the column parameter. You can also directly apply this method to an individual column of the dataframe and just specify the column(s) to group the data on.

The following is the syntax –

Basic Syntax:

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

DataFrame.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, **kwargs)

Parameters:

  • data – The pandas object holding the data.
  • column – If passed, will be used to limit data to a subset of columns.
  • by – If passed, then used to form histograms for separate groups.

For more details about the arguments, refer this.

Now let us understand the above method with some worked out examples.

Example 1 – Plot histogram of column values by group in a pandas dataframe on separate plots

Let’s create a pandas dataframe with two columns – “col1”, a column storing categorical data which will be used to group the data, and “col2”, a column with numerical data.

And then, plot the distribution of the values in “col2” for each group (decided by “col1” values) using the pandas hist() function.

import pandas as pd
import numpy as np

#create DataFrame
df = pd.DataFrame({'col1': np.repeat(['W','X', 'Y', 'Z'], 50),
                   'col2': np.random.normal(loc=10, scale=2, size=200)})

#Plotting the histogram by group in multiple plots
df['col2'].hist(by=df['col1'])

Output:

the resulting histograms - one for each group

In the above example, we –

  1. Import the required modules.
  2. Create a dataframe with the first column filled with values W, X, Y, Z each 50 times, then filled the second column with numerical values using numpy.random.normal (refer this).
  3. Plot the histogram by group in multiple plots using the pandas hist() function.

Example 2 – Histgrom by group in multiple plots with customizations

You can customize the resulting plots by passing additional parameters to the pandas hist() function, for example, let’s change the edge color of the histogram bars to red.

import pandas as pd
import numpy as np

#create DataFrame
df = pd.DataFrame({'col1': np.repeat(['W','X', 'Y', 'Z'], 50),
                   'col2': np.random.normal(loc=10, scale=2, size=200)})

#Plotting the histogram by group in multiple plots
df['col2'].hist(by=df['col1'], edgecolor='red', figsize = (8,6)) 

Output:

the resulting histogram with the line edges of red color

The histogram bars now have a red edge.

Method 2 – Plot Histograms by Group in One Plot

You can use the matplotlib.pyplot.hist() function to plot the histograms of groups of data in a single plot. This type of histogram shows the level of overlap in the distribution of the values across different groups.

The following is the syntax –

Basic Syntax:

matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None, cumulative=False, bottom=None, histtype='bar', align='mid', orientation='vertical', rwidth=None, log=False, color=None, label=None, stacked=False, *, data=None, **kwargs)

Parameters:

  • x: Input values, this takes either a single array or a sequence of arrays that are not required to be of the same length.
  • range: The lower and upper range of the bins. Lower and upper outliers are ignored. If not provided, range is (x.min(), x.max()). Range has no effect if bins is a sequence.

For more details about the arguments, refer this.

Now let us understand the usage of this method with an example.

We’ll take the same dataframe as above – a dataframe with two columns – “col1”, a column storing categorical data which will be used to group the data, and “col2”, a column with numerical data.

Now, the matplotlib.pyplot.hist() function, by itself, cannot group the data. So we’ll have to group the data separately, and then plot the histogram for each group on the same plot.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#create DataFrame
df = pd.DataFrame({'col1': np.repeat(['W','X', 'Y', 'z'], 50),
                   'col2': np.random.normal(loc=10, scale=2, size=200)})

#define points values by group
W = df.loc[df['col1'] == 'W', 'col2']
X = df.loc[df['col1'] == 'X', 'col2']
Y = df.loc[df['col1'] == 'Y', 'col2']
Z = df.loc[df['col1'] == 'Z', 'col2']

#add four histograms to one plot
plt.hist(W, alpha=0.5, label='W')
plt.hist(X, alpha=0.5, label='X')
plt.hist(Y, alpha=0.5, label='Y')
plt.hist(Z, alpha=0.5, label='Z')

plt.legend(title='Col2')
plt.show()

Output:

all histograms on a single plot

In the above example, we –

  1. Import the required modules.
  2. Create a dataframe with the first column filled with values W, X, Y, Z each 50 times, then filled the second column with numerical values using numpy.random.normal (refer this).
  3. Group the data based on the values in “col1”.
  4. Plot the histogram for each group in the same plot using the matplotlib.pyplot.hist() function. Note that we use the alpha parameter to make the histograms more transparent so that we can easily see the overlap.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Author

  • Chaitanya Betha

    I'm an undergrad student at IIT Madras interested in exploring new technologies. I have worked on various projects related to Data science, Machine learning & Neural Networks, including image classification using Convolutional Neural Networks, Stock prediction using Recurrent Neural Networks, and many more machine learning model training. I write blog articles in which I would try to provide a complete guide on a particular topic and try to cover as many different examples as possible with all the edge cases to understand the topic better and have a complete glance over the topic.

Scroll to Top