Pandas is a powerful data manipulation library in python. Among other manipulations, you can use it to compute cumulative sum of a row or a column in a dataset. In this tutorial, we’ll look at how to get the cumulative sum of a pandas dataframe column.
How to calculate cumulative sum in pandas?
You can use the pandas series cumsum()
function to calculate the cumulative sum of pandas column. The following is the syntax:
# cumulative sum of column 'Col1' df['Col1'].cumsum()
When applied on a pandas series, the cumsum()
function returns a pandas series of the cumulative sum of the original series values. You can also apply it to an entire dataframe, in which case it returns a dataframe with cumulative sum of all the numerical columns.
Examples
Let’s look at some examples of using the cumsum()
function to get the cumulative sum. First, we’ll create a sample dataframe that we’ll be using throughout this tutorial.
import pandas as pd # create dataframe df = pd.DataFrame({'PageViews': [100, 120, 180, 200, 240, 160, 130], 'Ad Revenue': [10, 15, 12, 20, 30, 22, 14]}, index = ['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04',\ '2020-03-05', '2020-03-06', '2020-03-07']) # print the dataframe print(df)
Output:
PageViews Ad Revenue 2020-03-01 100 10 2020-03-02 120 15 2020-03-03 180 12 2020-03-04 200 20 2020-03-05 240 30 2020-03-06 160 22 2020-03-07 130 14
The dataframe df
created above stores the daily pageviews and the ad-revenue of a blogging website.
1. Cumulative sum of a pandas series
Let’s apply the pandas cumsum()
function on a single column. For example, to get the cumulative ad-revenue generated by the website, we’ll apply the cumsum() function on the “Ad Revenue” column.
# cumulative Ad Reveunue df['Cumulative Ad Revenue'] = df['Ad Revenue'].cumsum() # display the dataframe print(df)
Output:
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
PageViews Ad Revenue Cumulative Ad Revenue 2020-03-01 100 10 10 2020-03-02 120 15 25 2020-03-03 180 12 37 2020-03-04 200 20 57 2020-03-05 240 30 87 2020-03-06 160 22 109 2020-03-07 130 14 123
We created an additional column, “Cumulative Ad Revenue” in the dataframe df. Notice that the values in this column are sum of all preceding “Ad Revenue” values. For example, for the date “2020-03-03”, the cumulative ad revenue is 37, that is, 10+15+12. The cumulative sum here gives us an idea that how much revenue has the website generated so far.
What if there are NaN(s) present?
What do you think would happen if try to get the cumulative sum of a column with Nan value(s) using the cumsum()
function? Let’s find out. First, we will drop the “Cumulative Ad Revenue” column created above and then set one value in the “Ad Revenue” to Nan.
import numpy as np # drop Cumulative Ad Revenue df = df.drop('Cumulative Ad Revenue', axis=1) # set an Ad Revenue to NaN df.loc['2020-03-02', 'Ad Revenue'] = np.nan # display the dataframe print(df)
Output:
PageViews Ad Revenue 2020-03-01 100 10.0 2020-03-02 120 NaN 2020-03-03 180 12.0 2020-03-04 200 20.0 2020-03-05 240 30.0 2020-03-06 160 22.0 2020-03-07 130 14.0
You can see that the “Ad Revenue” corresponding to “2020-03-02” is NaN. Let’s go ahead and get the cumulative sum of this column.
# cumulative Ad Reveunue df['Cumulative Ad Revenue'] = df['Ad Revenue'].cumsum() # display the dataframe print(df)
Output:
PageViews Ad Revenue Cumulative Ad Revenue 2020-03-01 100 10.0 10.0 2020-03-02 120 NaN NaN 2020-03-03 180 12.0 22.0 2020-03-04 200 20.0 42.0 2020-03-05 240 30.0 72.0 2020-03-06 160 22.0 94.0 2020-03-07 130 14.0 108.0
Notice that the “Cumulative Ad Revenue” is NaN where the “Ad Revenue” is NaN. The computation of cumulative sum is done without taking into account the NaN value. For example, for the date “2020-03-03”, the cumulative revenue is 22.0 which is 10.0+12.0
2. Cumulative sum of a pandas dataframe
You can also apply the cumsum()
function on an entire dataframe. For example, let’s start fresh and create the original dataframe df containing the pageviews and the daily ad revenue.
# create dataframe df = pd.DataFrame({'PageViews': [100, 120, 180, 200, 240, 160, 130], 'Ad Revenue': [10, 15, 12, 20, 30, 22, 14]}, index = ['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04',\ '2020-03-05', '2020-03-06', '2020-03-07']) # print the dataframe print(df)
Output:
PageViews Ad Revenue 2020-03-01 100 10 2020-03-02 120 15 2020-03-03 180 12 2020-03-04 200 20 2020-03-05 240 30 2020-03-06 160 22 2020-03-07 130 14
Now let’s apply the pandas cumsum()
function on the entire dataframe.
# cumulative sum of dataframe print(df.cumsum())
Output:
PageViews Ad Revenue 2020-03-01 100 10 2020-03-02 220 25 2020-03-03 400 37 2020-03-04 600 57 2020-03-05 840 87 2020-03-06 1000 109 2020-03-07 1130 123
The resulting dataframe has cumulative sum values for all the numerical columns in the original dataframe.
For more on the pandas series cumsum()
function, refer to its documentation.
With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having numpy version 1.18.5 and pandas version 1.0.5
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.