In this tutorial, we will see different aggregate functions in Pyspark and how to use them on dataframes with the help of examples.

## How to apply them to Pyspark dataframes?

Aggregate functions are used to combine the data using descriptive statistics like count, average, min, max, etc. You can apply aggregate functions to Pyspark dataframes by using the specific aggregate function with the `select()`

method or the `agg()`

method. Let’s look at them in detail.

**📚 Discover Online Data Science Courses & Programs (Enroll for Free)**

**Introductory ⭐**

- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science

**Intermediate ⭐⭐⭐**

- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization

**Advanced ⭐⭐⭐⭐⭐**

- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science

**🔎 Find Data Science Programs 👨💻 111,889 already enrolled**

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

### Using `select()`

method

In this method, we use the `select()`

method along with the required aggregate function (for example, `sum()`

) to get the aggregate of a single column. The following is the syntax –

dataframe.select(aggregate_function("column"))

### Using `agg()`

method

In this method, we use the `agg()`

method along with the required aggregate function to get the aggregate of one or more columns. Here, we pass a dictionary to the `agg()`

method with column names as key and the required aggregation method as the value.

The following is the syntax –

**Upskill your career right now →**

dataframe.agg({'column 1': 'aggregate_function',.....,'column n':'aggregate_function'})

## Examples

Let’s look at some examples of using the above syntax to compute aggregations on a Pyspark dataframe. First, let’s create a Pyspark dataframe that we will be using throughout this tutorial.

#import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # books data as list of lists df = [[1, "php", "sravan", 234,45], [2, "sql", "chandra sekhar", 345,67], [3, "python", "harsha", 1200,100], [4, "R", "Rohith", 120,45], [5, "hadoop", "manasa", 2340,230], ] # create a dataframe from the books data dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price','Pages']) # display the dataframe dataframe.show()

Output:

+-------+---------+--------------+-----+-----+ |Book_Id|Book_Name| Author|Price|Pages| +-------+---------+--------------+-----+-----+ | 1| php| sravan| 234| 45| | 2| sql|chandra sekhar| 345| 67| | 3| python| harsha| 1200| 100| | 4| R| Rohith| 120| 45| | 5| hadoop| manasa| 2340| 230| +-------+---------+--------------+-----+-----+

We now have a dataframe containing details of some programming books.

Let’s now look at different examples of using some of the aggregation functions available in Pyspark like the ones mentioned below –

`sum()`

– Sum total value for given column/s.`min()`

– The minimum value for given column/s.`max()`

– The maximum value for given column/s.`avg()`

– Average value for given column/s.`mean()`

– Average value for given column/s.

### Sum of one or more columns

In this example, we get the total value with the `sum()`

function using the two methods mentioned above.

# import sum() from pyspark.sql.functions import sum #get the sum of Price column dataframe.select(sum('Price')).show() #get the sum using agg() method dataframe.agg({'Pages': 'sum','Price':'sum'}).show()

Output:

+----------+ |sum(Price)| +----------+ | 4239| +----------+ +----------+----------+ |sum(Pages)|sum(Price)| +----------+----------+ | 487| 4239| +----------+----------+

Here we compute the sum of values in a column using the `select()`

method and the `agg()`

method. In the first method, we pass the “Price” column to get the sum of values in this column. In the second method, we pass the “Price” and “Pages” columns to get their respective sums.

**Upskill your career right now →**

### Count in one or more columns

In this example, we get the count of values with the `count()`

function using the two methods mentioned above.

# import count() from pyspark.sql.functions import count #get the count of Price column dataframe.select(count('Price')).show() #get the count using agg() method dataframe.agg({'Pages': 'count','Price':'count'}).show()

Output:

+------------+ |count(Price)| +------------+ | 5| +------------+ +------------+------------+ |count(Pages)|count(Price)| +------------+------------+ | 5| 5| +------------+------------+

We get the count of values as 5 for the columns “Price” and “Pages”.

### Minimum in one or more columns

In this example, we get the minimum value with the `min()`

function using the two methods mentioned above.

# import min() from pyspark.sql.functions import min #get the minimum of Price column dataframe.select(min('Price')).show() #get the minimum using agg() method dataframe.agg({'Pages': 'min','Price':'min'}).show()

Output:

+----------+ |min(Price)| +----------+ | 120| +----------+ +----------+----------+ |min(Pages)|min(Price)| +----------+----------+ | 45| 120| +----------+----------+

Here we compute the minimum value in a column using the `select()`

method and the `agg()`

method. In the first method, we pass the “Price” column to get its min value. In the second method, we pass the “Price” and “Pages” columns to get their respective minimum values.

### Maximum in one or more columns

In this example, we get the minimum value with the `max()`

function using the two methods mentioned above.

# import max() from pyspark.sql.functions import max #get the maximum of Price column dataframe.select(max("Price")).show() #get the maximum using agg() method dataframe.agg({'Pages': 'max','Price':'max'}).show()

Output:

+----------+ |max(Price)| +----------+ | 2340| +----------+ +----------+----------+ |max(Pages)|max(Price)| +----------+----------+ | 230| 2340| +----------+----------+

Here we compute the maximum value in a column using the `select()`

method and the `agg()`

method. In the first method, we pass the “Price” column to get its max value. In the second method, we pass the “Price” and “Pages” columns to get their respective maximum values.

### Average in one or more columns

In this example, we get the average value in a column with the `avg()`

and `mean()`

functions using the two methods mentioned above.

# import avg() and mean() from pyspark.sql.functions import avg, mean #get the average of Price column using avg() dataframe.select(avg('Price')).show() #get the average using agg() method using avg() dataframe.agg({'Pages': 'avg','Price':'avg'}).show() #get the average of Price column using mean() dataframe.select(mean('Price')).show() #get the average using agg() method using mean() dataframe.agg({'Pages': 'mean','Price':'mean'}).show()

Output:

+----------+ |avg(Price)| +----------+ | 847.8| +----------+ +----------+----------+ |avg(Pages)|avg(Price)| +----------+----------+ | 97.4| 847.8| +----------+----------+ +----------+ |avg(Price)| +----------+ | 847.8| +----------+ +----------+----------+ |avg(Pages)|avg(Price)| +----------+----------+ | 97.4| 847.8| +----------+----------+

You can see that both `avg()`

and the `mean()`

aggregate functions give the same results.

You might also be interested in –

- Pandas – Get Sum of one or more Columns
- Average for each row in Pandas Dataframe
- Display DataFrame in Pyspark with show()

**Subscribe to our newsletter for more informative guides and tutorials. ****We do not spam and you can opt out any time.**