Standard deviation is a descriptive statistic used as a measure of the spread in the data. In this tutorial, we will look at how to get the standard deviation of a column in a Pyspark dataframe with the help of some examples.

## How to get standard deviation for a Pyspark dataframe column?

You can use the `stddev()`

function from the `pyspark.sql.functions`

module to compute the standard deviation of a Pyspark column. The following is the syntax –

stddev("column_name")

Pass the column name as a parameter to the `stddev()`

function.

You can similarly use the `stddev_samp()`

function to get the sample standard deviation and the `stddev_pop()`

function to get the population standard deviation. Both the functions are available in the same `pyspark.sql.functions`

module.

## Examples

Let’s look at some examples of computing standard deviation for column(s) in a Pyspark dataframe. First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.

#import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # books data as list of lists df = [[1, "PHP", "Sravan", 250, 454], [2, "SQL", "Chandra", 300, 320], [3, "Python", "Harsha", 250, 500], [4, "R", "Rohith", 1200, 310], [5, "Hadoop", "Manasa", 700, 270], ] # creating dataframe from books data dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price', 'Pages']) # display the dataframe dataframe.show()

Output:

+-------+---------+-------+-----+-----+ |Book_Id|Book_Name| Author|Price|Pages| +-------+---------+-------+-----+-----+ | 1| PHP| Sravan| 250| 454| | 2| SQL|Chandra| 300| 320| | 3| Python| Harsha| 250| 500| | 4| R| Rohith| 1200| 310| | 5| Hadoop| Manasa| 700| 270| +-------+---------+-------+-----+-----+

We have a dataframe containing information on books like their author names, prices, pages, etc.

### Standard deviation of a single column

Let’s compute the standard deviation for the “Price” column in the dataframe. To do so, you can use the `stddev()`

function in combination with the Pyspark `select()`

function.

from pyspark.sql.functions import stddev # standard deviation of the Price column dataframe.select(stddev("Price")).show()

Output:

+------------------+ |stddev_samp(Price)| +------------------+ | 414.427315702042| +------------------+

We get the standard deviation for the “Price” column. Note that the `std_dev()`

function gives the sample standard deviation.

Alternatively, you can use the Pyspark `agg()`

function to compute the std deviation for a column.

# standard deviation of the Price column dataframe.agg({'Price': 'stddev'}).show()

Output:

+----------------+ | stddev(Price)| +----------------+ |414.427315702042| +----------------+

We get the same result as above.

Let’s now use the `stddev_samp()`

and `stddev_pop()`

functions on the same column along with the `stddev()`

function to compare their results.

from pyspark.sql.functions import stddev, stddev_samp, stddev_pop # standard deviation of the Price column dataframe.select(stddev("Price"), stddev_samp("Price"), stddev_pop("Price")).show()

Output:

+------------------+------------------+-----------------+ |stddev_samp(Price)|stddev_samp(Price)|stddev_pop(Price)| +------------------+------------------+-----------------+ | 414.427315702042| 414.427315702042|370.6750598570128| +------------------+------------------+-----------------+

You can see that `stddev()`

and `steddev_samp()`

give the same result which is the sample standard deviation whereas the `stddev_pop()`

function gave the population standard deviation.

### Standard deviation for more than one column

You can get the standard deviation for more than one column as well. Inside the `select()`

function, use a separate `stddev()`

function for each column you want to compute the std dev for.

Let’s compute the std dev for the “Price” and the “Pages” columns.

from pyspark.sql.functions import stddev # standard deviation of the Price and Pages columns dataframe.select(stddev("Price"), stddev("Pages")).show()

Output:

+------------------+------------------+ |stddev_samp(Price)|stddev_samp(Pages)| +------------------+------------------+ | 414.427315702042|100.06597823436296| +------------------+------------------+

We get the desired output.

You can also use the `agg()`

function to compute the std dev of multiple columns.

# standard deviation of the Price and Pages columns dataframe.agg({'Price': 'stddev', 'Pages': 'Stddev'}).show()

Output:

+------------------+----------------+ | stddev(Pages)| stddev(Price)| +------------------+----------------+ |100.06597823436296|414.427315702042| +------------------+----------------+

We get the same result as above.

You might also be interested in –

**Subscribe to our newsletter for more informative guides and tutorials. ****We do not spam and you can opt out any time.**