# Pyspark – Standard Deviation of a Column

Standard deviation is a descriptive statistic used as a measure of the spread in the data. In this tutorial, we will look at how to get the standard deviation of a column in a Pyspark dataframe with the help of some examples.

## How to get standard deviation for a Pyspark dataframe column?

You can use the `stddev()` function from the `pyspark.sql.functions` module to compute the standard deviation of a Pyspark column. The following is the syntax –

`stddev("column_name")`

Pass the column name as a parameter to the `stddev()` function.

You can similarly use the `stddev_samp()` function to get the sample standard deviation and the `stddev_pop()` function to get the population standard deviation. Both the functions are available in the same `pyspark.sql.functions` module.

## Examples

Let’s look at some examples of computing standard deviation for column(s) in a Pyspark dataframe. First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.

```#import the pyspark module
import pyspark

# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession

# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# books data as list of lists
df = [[1, "PHP", "Sravan", 250, 454],
[2, "SQL", "Chandra", 300, 320],
[3, "Python", "Harsha", 250, 500],
[4, "R", "Rohith", 1200, 310],
]

# creating dataframe from books data
dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price', 'Pages'])

# display the dataframe
dataframe.show()```

Output:

```+-------+---------+-------+-----+-----+
|Book_Id|Book_Name| Author|Price|Pages|
+-------+---------+-------+-----+-----+
|      1|      PHP| Sravan|  250|  454|
|      2|      SQL|Chandra|  300|  320|
|      3|   Python| Harsha|  250|  500|
|      4|        R| Rohith| 1200|  310|
|      5|   Hadoop| Manasa|  700|  270|
+-------+---------+-------+-----+-----+```

We have a dataframe containing information on books like their author names, prices, pages, etc.

### Standard deviation of a single column

Let’s compute the standard deviation for the “Price” column in the dataframe. To do so, you can use the `stddev()` function in combination with the Pyspark `select()` function.

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

```from pyspark.sql.functions import stddev

# standard deviation of the Price column
dataframe.select(stddev("Price")).show()```

Output:

```+------------------+
|stddev_samp(Price)|
+------------------+
|  414.427315702042|
+------------------+```

We get the standard deviation for the “Price” column. Note that the `std_dev()` function gives the sample standard deviation.

Alternatively, you can use the Pyspark `agg()` function to compute the std deviation for a column.

```# standard deviation of the Price column
dataframe.agg({'Price': 'stddev'}).show()```

Output:

```+----------------+
|   stddev(Price)|
+----------------+
|414.427315702042|
+----------------+```

We get the same result as above.

Let’s now use the `stddev_samp()` and `stddev_pop()` functions on the same column along with the `stddev()` function to compare their results.

```from pyspark.sql.functions import stddev, stddev_samp, stddev_pop

# standard deviation of the Price column
dataframe.select(stddev("Price"), stddev_samp("Price"), stddev_pop("Price")).show()```

Output:

```+------------------+------------------+-----------------+
|stddev_samp(Price)|stddev_samp(Price)|stddev_pop(Price)|
+------------------+------------------+-----------------+
|  414.427315702042|  414.427315702042|370.6750598570128|
+------------------+------------------+-----------------+```

You can see that `stddev()` and `steddev_samp()` give the same result which is the sample standard deviation whereas the `stddev_pop()` function gave the population standard deviation.

### Standard deviation for more than one column

You can get the standard deviation for more than one column as well. Inside the `select()` function, use a separate `stddev()` function for each column you want to compute the std dev for.

Let’s compute the std dev for the “Price” and the “Pages” columns.

```from pyspark.sql.functions import stddev

# standard deviation of the Price and Pages columns
dataframe.select(stddev("Price"), stddev("Pages")).show()```

Output:

```+------------------+------------------+
|stddev_samp(Price)|stddev_samp(Pages)|
+------------------+------------------+
|  414.427315702042|100.06597823436296|
+------------------+------------------+```

We get the desired output.

You can also use the `agg()` function to compute the std dev of multiple columns.

```# standard deviation of the Price and Pages columns
dataframe.agg({'Price': 'stddev', 'Pages': 'Stddev'}).show()```

Output:

```+------------------+----------------+
|     stddev(Pages)|   stddev(Price)|
+------------------+----------------+
|100.06597823436296|414.427315702042|
+------------------+----------------+```

We get the same result as above.

You might also be interested in –