In this tutorial, we will see different aggregate functions in Pyspark and how to use them on dataframes with the help of examples.
How to apply them to Pyspark dataframes?
Aggregate functions are used to combine the data using descriptive statistics like count, average, min, max, etc. You can apply aggregate functions to Pyspark dataframes by using the specific aggregate function with the select()
method or the agg()
method. Let’s look at them in detail.
Using select()
method
In this method, we use the select()
method along with the required aggregate function (for example, sum()
) to get the aggregate of a single column. The following is the syntax –
dataframe.select(aggregate_function("column"))
Using agg()
method
In this method, we use the agg()
method along with the required aggregate function to get the aggregate of one or more columns. Here, we pass a dictionary to the agg()
method with column names as key and the required aggregation method as the value.
The following is the syntax –
dataframe.agg({'column 1': 'aggregate_function',.....,'column n':'aggregate_function'})
Examples
Let’s look at some examples of using the above syntax to compute aggregations on a Pyspark dataframe. First, let’s create a Pyspark dataframe that we will be using throughout this tutorial.
#import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # books data as list of lists df = [[1, "php", "sravan", 234,45], [2, "sql", "chandra sekhar", 345,67], [3, "python", "harsha", 1200,100], [4, "R", "Rohith", 120,45], [5, "hadoop", "manasa", 2340,230], ] # create a dataframe from the books data dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price','Pages']) # display the dataframe dataframe.show()
Output:
+-------+---------+--------------+-----+-----+ |Book_Id|Book_Name| Author|Price|Pages| +-------+---------+--------------+-----+-----+ | 1| php| sravan| 234| 45| | 2| sql|chandra sekhar| 345| 67| | 3| python| harsha| 1200| 100| | 4| R| Rohith| 120| 45| | 5| hadoop| manasa| 2340| 230| +-------+---------+--------------+-----+-----+
We now have a dataframe containing details of some programming books.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
Let’s now look at different examples of using some of the aggregation functions available in Pyspark like the ones mentioned below –
sum()
– Sum total value for given column/s.min()
– The minimum value for given column/s.max()
– The maximum value for given column/s.avg()
– Average value for given column/s.mean()
– Average value for given column/s.
Sum of one or more columns
In this example, we get the total value with the sum()
function using the two methods mentioned above.
# import sum() from pyspark.sql.functions import sum #get the sum of Price column dataframe.select(sum('Price')).show() #get the sum using agg() method dataframe.agg({'Pages': 'sum','Price':'sum'}).show()
Output:
+----------+ |sum(Price)| +----------+ | 4239| +----------+ +----------+----------+ |sum(Pages)|sum(Price)| +----------+----------+ | 487| 4239| +----------+----------+
Here we compute the sum of values in a column using the select()
method and the agg()
method. In the first method, we pass the “Price” column to get the sum of values in this column. In the second method, we pass the “Price” and “Pages” columns to get their respective sums.
Count in one or more columns
In this example, we get the count of values with the count()
function using the two methods mentioned above.
# import count() from pyspark.sql.functions import count #get the count of Price column dataframe.select(count('Price')).show() #get the count using agg() method dataframe.agg({'Pages': 'count','Price':'count'}).show()
Output:
+------------+ |count(Price)| +------------+ | 5| +------------+ +------------+------------+ |count(Pages)|count(Price)| +------------+------------+ | 5| 5| +------------+------------+
We get the count of values as 5 for the columns “Price” and “Pages”.
Minimum in one or more columns
In this example, we get the minimum value with the min()
function using the two methods mentioned above.
# import min() from pyspark.sql.functions import min #get the minimum of Price column dataframe.select(min('Price')).show() #get the minimum using agg() method dataframe.agg({'Pages': 'min','Price':'min'}).show()
Output:
+----------+ |min(Price)| +----------+ | 120| +----------+ +----------+----------+ |min(Pages)|min(Price)| +----------+----------+ | 45| 120| +----------+----------+
Here we compute the minimum value in a column using the select()
method and the agg()
method. In the first method, we pass the “Price” column to get its min value. In the second method, we pass the “Price” and “Pages” columns to get their respective minimum values.
Maximum in one or more columns
In this example, we get the minimum value with the max()
function using the two methods mentioned above.
# import max() from pyspark.sql.functions import max #get the maximum of Price column dataframe.select(max("Price")).show() #get the maximum using agg() method dataframe.agg({'Pages': 'max','Price':'max'}).show()
Output:
+----------+ |max(Price)| +----------+ | 2340| +----------+ +----------+----------+ |max(Pages)|max(Price)| +----------+----------+ | 230| 2340| +----------+----------+
Here we compute the maximum value in a column using the select()
method and the agg()
method. In the first method, we pass the “Price” column to get its max value. In the second method, we pass the “Price” and “Pages” columns to get their respective maximum values.
Average in one or more columns
In this example, we get the average value in a column with the avg()
and mean()
functions using the two methods mentioned above.
# import avg() and mean() from pyspark.sql.functions import avg, mean #get the average of Price column using avg() dataframe.select(avg('Price')).show() #get the average using agg() method using avg() dataframe.agg({'Pages': 'avg','Price':'avg'}).show() #get the average of Price column using mean() dataframe.select(mean('Price')).show() #get the average using agg() method using mean() dataframe.agg({'Pages': 'mean','Price':'mean'}).show()
Output:
+----------+ |avg(Price)| +----------+ | 847.8| +----------+ +----------+----------+ |avg(Pages)|avg(Price)| +----------+----------+ | 97.4| 847.8| +----------+----------+ +----------+ |avg(Price)| +----------+ | 847.8| +----------+ +----------+----------+ |avg(Pages)|avg(Price)| +----------+----------+ | 97.4| 847.8| +----------+----------+
You can see that both avg()
and the mean()
aggregate functions give the same results.
You might also be interested in –
- Pandas – Get Sum of one or more Columns
- Average for each row in Pandas Dataframe
- Display DataFrame in Pyspark with show()
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.