Aggregate functions in pyspark

Aggregate Functions in PySpark

In this tutorial, we will see different aggregate functions in Pyspark and how to use them on dataframes with the help of examples.

How to apply them to Pyspark dataframes?

Aggregate functions are used to combine the data using descriptive statistics like count, average, min, max, etc. You can apply aggregate functions to Pyspark dataframes by using the specific aggregate function with the select() method or the agg() method. Let’s look at them in detail.

Using select() method

In this method, we use the select() method along with the required aggregate function (for example, sum()) to get the aggregate of a single column. The following is the syntax –

dataframe.select(aggregate_function("column"))

Using agg() method

In this method, we use the agg() method along with the required aggregate function to get the aggregate of one or more columns. Here, we pass a dictionary to the agg() method with column names as key and the required aggregation method as the value.

The following is the syntax –

dataframe.agg({'column 1': 'aggregate_function',.....,'column n':'aggregate_function'})

Examples

Let’s look at some examples of using the above syntax to compute aggregations on a Pyspark dataframe. First, let’s create a Pyspark dataframe that we will be using throughout this tutorial.

#import the pyspark module
import pyspark
  
# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession

# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# books data as list of lists
df = [[1, "php", "sravan", 234,45],
        [2, "sql", "chandra sekhar", 345,67],
        [3, "python", "harsha", 1200,100],
        [4, "R", "Rohith", 120,45],
        [5, "hadoop", "manasa", 2340,230],
        ]

# create a dataframe from the books data
dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price','Pages'])

# display the dataframe
dataframe.show()

Output:

+-------+---------+--------------+-----+-----+
|Book_Id|Book_Name|        Author|Price|Pages|
+-------+---------+--------------+-----+-----+
|      1|      php|        sravan|  234|   45|
|      2|      sql|chandra sekhar|  345|   67|
|      3|   python|        harsha| 1200|  100|
|      4|        R|        Rohith|  120|   45|
|      5|   hadoop|        manasa| 2340|  230|
+-------+---------+--------------+-----+-----+

We now have a dataframe containing details of some programming books.

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

Let’s now look at different examples of using some of the aggregation functions available in Pyspark like the ones mentioned below –

  • sum() – Sum total value for given column/s.
  • min() – The minimum value for given column/s.
  • max() – The maximum value for given column/s.
  • avg() – Average value for given column/s.
  • mean() – Average value for given column/s.

Sum of one or more columns

In this example, we get the total value with the sum() function using the two methods mentioned above.

# import sum()
from pyspark.sql.functions import sum

#get the sum of Price column 
dataframe.select(sum('Price')).show()

#get the sum using agg() method
dataframe.agg({'Pages': 'sum','Price':'sum'}).show()

Output:

+----------+
|sum(Price)|
+----------+
|      4239|
+----------+

+----------+----------+
|sum(Pages)|sum(Price)|
+----------+----------+
|       487|      4239|
+----------+----------+

Here we compute the sum of values in a column using the select() method and the agg() method. In the first method, we pass the “Price” column to get the sum of values in this column. In the second method, we pass the “Price” and “Pages” columns to get their respective sums.

Count in one or more columns

In this example, we get the count of values with the count() function using the two methods mentioned above.

# import count()
from pyspark.sql.functions import count

#get the count of Price column 
dataframe.select(count('Price')).show()

#get the count using agg() method
dataframe.agg({'Pages': 'count','Price':'count'}).show()

Output:

+------------+
|count(Price)|
+------------+
|           5|
+------------+

+------------+------------+
|count(Pages)|count(Price)|
+------------+------------+
|           5|           5|
+------------+------------+

We get the count of values as 5 for the columns “Price” and “Pages”.

Minimum in one or more columns

In this example, we get the minimum value with the min() function using the two methods mentioned above.

# import min()
from pyspark.sql.functions import min

#get the minimum of Price column 
dataframe.select(min('Price')).show()

#get the minimum using agg() method
dataframe.agg({'Pages': 'min','Price':'min'}).show()

Output:

+----------+
|min(Price)|
+----------+
|       120|
+----------+

+----------+----------+
|min(Pages)|min(Price)|
+----------+----------+
|        45|       120|
+----------+----------+

Here we compute the minimum value in a column using the select() method and the agg() method. In the first method, we pass the “Price” column to get its min value. In the second method, we pass the “Price” and “Pages” columns to get their respective minimum values.

Maximum in one or more columns

In this example, we get the minimum value with the max() function using the two methods mentioned above.

# import max()
from pyspark.sql.functions import max

#get the maximum of Price column 
dataframe.select(max("Price")).show()

#get the maximum using agg() method
dataframe.agg({'Pages': 'max','Price':'max'}).show()

Output:

+----------+
|max(Price)|
+----------+
|      2340|
+----------+

+----------+----------+
|max(Pages)|max(Price)|
+----------+----------+
|       230|      2340|
+----------+----------+

Here we compute the maximum value in a column using the select() method and the agg() method. In the first method, we pass the “Price” column to get its max value. In the second method, we pass the “Price” and “Pages” columns to get their respective maximum values.

Average in one or more columns

In this example, we get the average value in a column with the avg() and mean() functions using the two methods mentioned above.

# import avg() and mean()
from pyspark.sql.functions import avg, mean

#get the average of Price column using avg()
dataframe.select(avg('Price')).show()

#get the average using agg() method using avg()
dataframe.agg({'Pages': 'avg','Price':'avg'}).show()

#get the average of Price column using mean()
dataframe.select(mean('Price')).show()

#get the average using agg() method using mean()
dataframe.agg({'Pages': 'mean','Price':'mean'}).show()

Output:

+----------+
|avg(Price)|
+----------+
|     847.8|
+----------+

+----------+----------+
|avg(Pages)|avg(Price)|
+----------+----------+
|      97.4|     847.8|
+----------+----------+

+----------+
|avg(Price)|
+----------+
|     847.8|
+----------+

+----------+----------+
|avg(Pages)|avg(Price)|
+----------+----------+
|      97.4|     847.8|
+----------+----------+

You can see that both avg() and the mean() aggregate functions give the same results.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Authors

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

  • Gottumukkala Sravan Kumar
Scroll to Top