cumulative sum of a column in pandas

Cumulative Sum of Column in Pandas DataFrame

Pandas is a powerful data manipulation library in python. Among other manipulations, you can use it to compute cumulative sum of a row or a column in a dataset. In this tutorial, we’ll look at how to get the cumulative sum of a pandas dataframe column.

You can use the pandas series cumsum() function to calculate the cumulative sum of pandas column. The following is the syntax:

# cumulative sum of column 'Col1'
df['Col1'].cumsum()

When applied on a pandas series, the cumsum() function returns a pandas series of the cumulative sum of the original series values. You can also apply it to an entire dataframe, in which case it returns a dataframe with cumulative sum of all the numerical columns.

Let’s look at some examples of using the cumsum() function to get the cumulative sum. First, we’ll create a sample dataframe that we’ll be using throughout this tutorial.

import pandas as pd

# create dataframe
df = pd.DataFrame({'PageViews': [100, 120, 180, 200, 240, 160, 130],
                   'Ad Revenue': [10, 15, 12, 20, 30, 22, 14]},
                  index = ['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04',\
                           '2020-03-05', '2020-03-06', '2020-03-07'])
# print the dataframe
print(df)

Output:

            PageViews  Ad Revenue
2020-03-01        100          10
2020-03-02        120          15
2020-03-03        180          12
2020-03-04        200          20
2020-03-05        240          30
2020-03-06        160          22
2020-03-07        130          14

The dataframe df created above stores the daily pageviews and the ad-revenue of a blogging website.

Let’s apply the pandas cumsum() function on a single column. For example, to get the cumulative ad-revenue generated by the website, we’ll apply the cumsum() function on the “Ad Revenue” column.

# cumulative Ad Reveunue
df['Cumulative Ad Revenue'] = df['Ad Revenue'].cumsum()
# display the dataframe
print(df)

Output:

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

            PageViews  Ad Revenue  Cumulative Ad Revenue
2020-03-01        100          10                     10
2020-03-02        120          15                     25
2020-03-03        180          12                     37
2020-03-04        200          20                     57
2020-03-05        240          30                     87
2020-03-06        160          22                    109
2020-03-07        130          14                    123
​

We created an additional column, “Cumulative Ad Revenue” in the dataframe df. Notice that the values in this column are sum of all preceding “Ad Revenue” values. For example, for the date “2020-03-03”, the cumulative ad revenue is 37, that is, 10+15+12. The cumulative sum here gives us an idea that how much revenue has the website generated so far.

What do you think would happen if try to get the cumulative sum of a column with Nan value(s) using the cumsum() function? Let’s find out. First, we will drop the “Cumulative Ad Revenue” column created above and then set one value in the “Ad Revenue” to Nan.

import numpy as np
# drop Cumulative Ad Revenue
df = df.drop('Cumulative Ad Revenue', axis=1)
# set an Ad Revenue to NaN
df.loc['2020-03-02', 'Ad Revenue'] = np.nan
# display the dataframe
print(df)

Output:

            PageViews  Ad Revenue
2020-03-01        100        10.0
2020-03-02        120         NaN
2020-03-03        180        12.0
2020-03-04        200        20.0
2020-03-05        240        30.0
2020-03-06        160        22.0
2020-03-07        130        14.0

You can see that the “Ad Revenue” corresponding to “2020-03-02” is NaN. Let’s go ahead and get the cumulative sum of this column.

# cumulative Ad Reveunue
df['Cumulative Ad Revenue'] = df['Ad Revenue'].cumsum()
# display the dataframe
print(df)

Output:

            PageViews  Ad Revenue  Cumulative Ad Revenue
2020-03-01        100        10.0                   10.0
2020-03-02        120         NaN                    NaN
2020-03-03        180        12.0                   22.0
2020-03-04        200        20.0                   42.0
2020-03-05        240        30.0                   72.0
2020-03-06        160        22.0                   94.0
2020-03-07        130        14.0                  108.0

Notice that the “Cumulative Ad Revenue” is NaN where the “Ad Revenue” is NaN. The computation of cumulative sum is done without taking into account the NaN value. For example, for the date “2020-03-03”, the cumulative revenue is 22.0 which is 10.0+12.0

You can also apply the cumsum() function on an entire dataframe. For example, let’s start fresh and create the original dataframe df containing the pageviews and the daily ad revenue.

# create dataframe
df = pd.DataFrame({'PageViews': [100, 120, 180, 200, 240, 160, 130],
                   'Ad Revenue': [10, 15, 12, 20, 30, 22, 14]},
                  index = ['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04',\
                           '2020-03-05', '2020-03-06', '2020-03-07'])
# print the dataframe
print(df)

Output:

            PageViews  Ad Revenue
2020-03-01        100          10
2020-03-02        120          15
2020-03-03        180          12
2020-03-04        200          20
2020-03-05        240          30
2020-03-06        160          22
2020-03-07        130          14

Now let’s apply the pandas cumsum() function on the entire dataframe.

# cumulative sum of dataframe
print(df.cumsum())

Output:

            PageViews  Ad Revenue
2020-03-01        100          10
2020-03-02        220          25
2020-03-03        400          37
2020-03-04        600          57
2020-03-05        840          87
2020-03-06       1000         109
2020-03-07       1130         123

The resulting dataframe has cumulative sum values for all the numerical columns in the original dataframe.

For more on the pandas series cumsum() function, refer to its documentation.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having numpy version 1.18.5 and pandas version 1.0.5


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Author

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

Scroll to Top