In this tutorial, we will look at how to get the variance of a column in a Pyspark dataframe with the help of some examples.

## How to get variance for a Pyspark dataframe column?

You can use the `variance()`

function from the `pyspark.sql.functions`

module to compute the variance of a Pyspark column. The following is the syntax –

variance("column_name")

Pass the column name as a parameter to the `variance()`

function.

You can similarly use the `variance_samp()`

function to get the sample variance and the `variance_pop()`

function to get the population variance. Both the functions are available in the same `pyspark.sql.functions`

module.

## Examples

Let’s look at some examples of computing variance for column(s) in a Pyspark dataframe. First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.

#import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # books data as list of lists df = [[1, "PHP", "Sravan", 250, 454], [2, "SQL", "Chandra", 300, 320], [3, "Python", "Harsha", 250, 500], [4, "R", "Rohith", 1200, 310], [5, "Hadoop", "Manasa", 700, 270], ] # creating dataframe from books data dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price', 'Pages']) # display the dataframe dataframe.show()

Output:

+-------+---------+-------+-----+-----+ |Book_Id|Book_Name| Author|Price|Pages| +-------+---------+-------+-----+-----+ | 1| PHP| Sravan| 250| 454| | 2| SQL|Chandra| 300| 320| | 3| Python| Harsha| 250| 500| | 4| R| Rohith| 1200| 310| | 5| Hadoop| Manasa| 700| 270| +-------+---------+-------+-----+-----+

We have a dataframe containing information on books like their author names, prices, pages, etc.

### Variance of a single column

Let’s compute the variance for the “Price” column in the dataframe. To do so, you can use the `variance()`

function in combination with the Pyspark `select()`

function.

from pyspark.sql.functions import variance # variance of the Price column dataframe.select(variance("Price")).show()

Output:

+---------------+ |var_samp(Price)| +---------------+ | 171750.0| +---------------+

We get the variance for the “Price” column. Note that the `variance()`

function gives the sample variance.

Alternatively, you can use the Pyspark `agg()`

function to compute the variance for a column.

# variance of the Price column dataframe.agg({'Price': 'variance'}).show()

Output:

+---------------+ |variance(Price)| +---------------+ | 171750.0| +---------------+

We get the same result as above.

Let’s now use the `var_samp()`

and `var_pop()`

functions on the same column along with the `variance()`

function to compare their results.

from pyspark.sql.functions import variance, var_samp, var_pop # variance of the Price column dataframe.select(variance("Price"), var_samp("Price"), var_pop("Price")).show()

Output:

+---------------+---------------+--------------+ |var_samp(Price)|var_samp(Price)|var_pop(Price)| +---------------+---------------+--------------+ | 171750.0| 171750.0| 137400.0| +---------------+---------------+--------------+

You can see that `variance()`

and `var_samp()`

give the same result which is the sample variance whereas the `var_pop()`

function gave the population variance.

### Variance for more than one column

You can get the variance for more than one column as well. Inside the `select()`

function, use a separate `variance()`

function for each column you want to compute the variance for.

Let’s compute the variance for the “Price” and the “Pages” columns.

from pyspark.sql.functions import variance # variance of the Price and Pages columns dataframe.select(variance("Price"), variance("Pages")).show()

Output:

+---------------+---------------+ |var_samp(Price)|var_samp(Pages)| +---------------+---------------+ | 171750.0| 10013.2| +---------------+---------------+

We get the desired output.

You can also use the `agg()`

function to compute the variance of multiple columns.

# variance of the Price and Pages columns dataframe.agg({'Price': 'variance', 'Pages': 'variance'}).show()

Output:

+---------------+---------------+ |variance(Pages)|variance(Price)| +---------------+---------------+ | 10013.2| 171750.0| +---------------+---------------+

We get the same result as above.

