Pandas is a powerful data manipulation library in python. Among other manipulations, you can use it to compute cumulative sum of a row or a column in a dataset. In this tutorial, we’ll look at how to get the cumulative sum of a pandas dataframe column.
How to calculate cumulative sum in pandas?
You can use the pandas series cumsum()
function to calculate the cumulative sum of pandas column. The following is the syntax:
# cumulative sum of column 'Col1' df['Col1'].cumsum()
When applied on a pandas series, the cumsum()
function returns a pandas series of the cumulative sum of the original series values. You can also apply it to an entire dataframe, in which case it returns a dataframe with cumulative sum of all the numerical columns.
Examples
Let’s look at some examples of using the cumsum()
function to get the cumulative sum. First, we’ll create a sample dataframe that we’ll be using throughout this tutorial.
import pandas as pd # create dataframe df = pd.DataFrame({'PageViews': [100, 120, 180, 200, 240, 160, 130], 'Ad Revenue': [10, 15, 12, 20, 30, 22, 14]}, index = ['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04',\ '2020-03-05', '2020-03-06', '2020-03-07']) # print the dataframe print(df)
Output:
Highlighted programs for you
Flatiron School
Flatiron School
University of Maryland Global Campus
University of Maryland Global Campus
Creighton University
Creighton University
PageViews Ad Revenue 2020-03-01 100 10 2020-03-02 120 15 2020-03-03 180 12 2020-03-04 200 20 2020-03-05 240 30 2020-03-06 160 22 2020-03-07 130 14
The dataframe df
created above stores the daily pageviews and the ad-revenue of a blogging website.
1. Cumulative sum of a pandas series
Let’s apply the pandas cumsum()
function on a single column. For example, to get the cumulative ad-revenue generated by the website, we’ll apply the cumsum() function on the “Ad Revenue” column.
# cumulative Ad Reveunue df['Cumulative Ad Revenue'] = df['Ad Revenue'].cumsum() # display the dataframe print(df)
Output:
PageViews Ad Revenue Cumulative Ad Revenue 2020-03-01 100 10 10 2020-03-02 120 15 25 2020-03-03 180 12 37 2020-03-04 200 20 57 2020-03-05 240 30 87 2020-03-06 160 22 109 2020-03-07 130 14 123
We created an additional column, “Cumulative Ad Revenue” in the dataframe df. Notice that the values in this column are sum of all preceding “Ad Revenue” values. For example, for the date “2020-03-03”, the cumulative ad revenue is 37, that is, 10+15+12. The cumulative sum here gives us an idea that how much revenue has the website generated so far.
What if there are NaN(s) present?
What do you think would happen if try to get the cumulative sum of a column with Nan value(s) using the cumsum()
function? Let’s find out. First, we will drop the “Cumulative Ad Revenue” column created above and then set one value in the “Ad Revenue” to Nan.
import numpy as np # drop Cumulative Ad Revenue df = df.drop('Cumulative Ad Revenue', axis=1) # set an Ad Revenue to NaN df.loc['2020-03-02', 'Ad Revenue'] = np.nan # display the dataframe print(df)
Output:
PageViews Ad Revenue 2020-03-01 100 10.0 2020-03-02 120 NaN 2020-03-03 180 12.0 2020-03-04 200 20.0 2020-03-05 240 30.0 2020-03-06 160 22.0 2020-03-07 130 14.0
You can see that the “Ad Revenue” corresponding to “2020-03-02” is NaN. Let’s go ahead and get the cumulative sum of this column.
# cumulative Ad Reveunue df['Cumulative Ad Revenue'] = df['Ad Revenue'].cumsum() # display the dataframe print(df)
Output:
PageViews Ad Revenue Cumulative Ad Revenue 2020-03-01 100 10.0 10.0 2020-03-02 120 NaN NaN 2020-03-03 180 12.0 22.0 2020-03-04 200 20.0 42.0 2020-03-05 240 30.0 72.0 2020-03-06 160 22.0 94.0 2020-03-07 130 14.0 108.0
Notice that the “Cumulative Ad Revenue” is NaN where the “Ad Revenue” is NaN. The computation of cumulative sum is done without taking into account the NaN value. For example, for the date “2020-03-03”, the cumulative revenue is 22.0 which is 10.0+12.0
2. Cumulative sum of a pandas dataframe
You can also apply the cumsum()
function on an entire dataframe. For example, let’s start fresh and create the original dataframe df containing the pageviews and the daily ad revenue.
# create dataframe df = pd.DataFrame({'PageViews': [100, 120, 180, 200, 240, 160, 130], 'Ad Revenue': [10, 15, 12, 20, 30, 22, 14]}, index = ['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04',\ '2020-03-05', '2020-03-06', '2020-03-07']) # print the dataframe print(df)
Output:
PageViews Ad Revenue 2020-03-01 100 10 2020-03-02 120 15 2020-03-03 180 12 2020-03-04 200 20 2020-03-05 240 30 2020-03-06 160 22 2020-03-07 130 14
Now let’s apply the pandas cumsum()
function on the entire dataframe.
# cumulative sum of dataframe print(df.cumsum())
Output:
PageViews Ad Revenue 2020-03-01 100 10 2020-03-02 220 25 2020-03-03 400 37 2020-03-04 600 57 2020-03-05 840 87 2020-03-06 1000 109 2020-03-07 1130 123
The resulting dataframe has cumulative sum values for all the numerical columns in the original dataframe.
For more on the pandas series cumsum()
function, refer to its documentation.
With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having numpy version 1.18.5 and pandas version 1.0.5
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.