Variance of a pyspark dataframe column

PySpark – Variance of a DataFrame Column

In this tutorial, we will look at how to get the variance of a column in a Pyspark dataframe with the help of some examples.

How to get variance for a Pyspark dataframe column?

You can use the variance() function from the pyspark.sql.functions module to compute the variance of a Pyspark column. The following is the syntax –

variance("column_name")

Pass the column name as a parameter to the variance() function.

You can similarly use the variance_samp() function to get the sample variance and the variance_pop() function to get the population variance. Both the functions are available in the same pyspark.sql.functions module.

Examples

Let’s look at some examples of computing variance for column(s) in a Pyspark dataframe. First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.

#import the pyspark module
import pyspark
  
# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession

# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# books data as list of lists
df = [[1, "PHP", "Sravan", 250, 454],
        [2, "SQL", "Chandra", 300, 320],
        [3, "Python", "Harsha", 250, 500],
        [4, "R", "Rohith", 1200, 310],
        [5, "Hadoop", "Manasa", 700, 270],
        ]
  
# creating dataframe from books data
dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price', 'Pages'])

# display the dataframe
dataframe.show()

Output:

+-------+---------+-------+-----+-----+
|Book_Id|Book_Name| Author|Price|Pages|
+-------+---------+-------+-----+-----+
|      1|      PHP| Sravan|  250|  454|
|      2|      SQL|Chandra|  300|  320|
|      3|   Python| Harsha|  250|  500|
|      4|        R| Rohith| 1200|  310|
|      5|   Hadoop| Manasa|  700|  270|
+-------+---------+-------+-----+-----+

We have a dataframe containing information on books like their author names, prices, pages, etc.

Variance of a single column

Let’s compute the variance for the “Price” column in the dataframe. To do so, you can use the variance() function in combination with the Pyspark select() function.

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

from pyspark.sql.functions import variance

# variance of the Price column
dataframe.select(variance("Price")).show()

Output:

+---------------+
|var_samp(Price)|
+---------------+
|       171750.0|
+---------------+

We get the variance for the “Price” column. Note that the variance() function gives the sample variance.

Alternatively, you can use the Pyspark agg() function to compute the variance for a column.

# variance of the Price column
dataframe.agg({'Price': 'variance'}).show()

Output:

+---------------+
|variance(Price)|
+---------------+
|       171750.0|
+---------------+

We get the same result as above.

Let’s now use the var_samp() and var_pop() functions on the same column along with the variance() function to compare their results.

from pyspark.sql.functions import variance, var_samp, var_pop

# variance of the Price column
dataframe.select(variance("Price"), var_samp("Price"), var_pop("Price")).show()

Output:

+---------------+---------------+--------------+
|var_samp(Price)|var_samp(Price)|var_pop(Price)|
+---------------+---------------+--------------+
|       171750.0|       171750.0|      137400.0|
+---------------+---------------+--------------+

You can see that variance() and var_samp() give the same result which is the sample variance whereas the var_pop() function gave the population variance.

Variance for more than one column

You can get the variance for more than one column as well. Inside the select() function, use a separate variance() function for each column you want to compute the variance for.

Let’s compute the variance for the “Price” and the “Pages” columns.

from pyspark.sql.functions import variance

# variance of the Price and Pages columns
dataframe.select(variance("Price"), variance("Pages")).show()

Output:

+---------------+---------------+
|var_samp(Price)|var_samp(Pages)|
+---------------+---------------+
|       171750.0|        10013.2|
+---------------+---------------+

We get the desired output.

You can also use the agg() function to compute the variance of multiple columns.

# variance of the Price and Pages columns
dataframe.agg({'Price': 'variance', 'Pages': 'variance'}).show()

Output:

+---------------+---------------+
|variance(Pages)|variance(Price)|
+---------------+---------------+
|        10013.2|       171750.0|
+---------------+---------------+

We get the same result as above.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Authors

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

  • Gottumukkala Sravan Kumar
Scroll to Top