In this tutorial, we will look at how to get the variance of a column in a Pyspark dataframe with the help of some examples.
How to get variance for a Pyspark dataframe column?
You can use the variance()
function from the pyspark.sql.functions
module to compute the variance of a Pyspark column. The following is the syntax –
variance("column_name")
Pass the column name as a parameter to the variance()
function.
You can similarly use the variance_samp()
function to get the sample variance and the variance_pop()
function to get the population variance. Both the functions are available in the same pyspark.sql.functions
module.
Examples
Let’s look at some examples of computing variance for column(s) in a Pyspark dataframe. First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.
#import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # books data as list of lists df = [[1, "PHP", "Sravan", 250, 454], [2, "SQL", "Chandra", 300, 320], [3, "Python", "Harsha", 250, 500], [4, "R", "Rohith", 1200, 310], [5, "Hadoop", "Manasa", 700, 270], ] # creating dataframe from books data dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price', 'Pages']) # display the dataframe dataframe.show()
Output:
+-------+---------+-------+-----+-----+ |Book_Id|Book_Name| Author|Price|Pages| +-------+---------+-------+-----+-----+ | 1| PHP| Sravan| 250| 454| | 2| SQL|Chandra| 300| 320| | 3| Python| Harsha| 250| 500| | 4| R| Rohith| 1200| 310| | 5| Hadoop| Manasa| 700| 270| +-------+---------+-------+-----+-----+
We have a dataframe containing information on books like their author names, prices, pages, etc.
Variance of a single column
Let’s compute the variance for the “Price” column in the dataframe. To do so, you can use the variance()
function in combination with the Pyspark select()
function.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
from pyspark.sql.functions import variance # variance of the Price column dataframe.select(variance("Price")).show()
Output:
+---------------+ |var_samp(Price)| +---------------+ | 171750.0| +---------------+
We get the variance for the “Price” column. Note that the variance()
function gives the sample variance.
Alternatively, you can use the Pyspark agg()
function to compute the variance for a column.
# variance of the Price column dataframe.agg({'Price': 'variance'}).show()
Output:
+---------------+ |variance(Price)| +---------------+ | 171750.0| +---------------+
We get the same result as above.
Let’s now use the var_samp()
and var_pop()
functions on the same column along with the variance()
function to compare their results.
from pyspark.sql.functions import variance, var_samp, var_pop # variance of the Price column dataframe.select(variance("Price"), var_samp("Price"), var_pop("Price")).show()
Output:
+---------------+---------------+--------------+ |var_samp(Price)|var_samp(Price)|var_pop(Price)| +---------------+---------------+--------------+ | 171750.0| 171750.0| 137400.0| +---------------+---------------+--------------+
You can see that variance()
and var_samp()
give the same result which is the sample variance whereas the var_pop()
function gave the population variance.
Variance for more than one column
You can get the variance for more than one column as well. Inside the select()
function, use a separate variance()
function for each column you want to compute the variance for.
Let’s compute the variance for the “Price” and the “Pages” columns.
from pyspark.sql.functions import variance # variance of the Price and Pages columns dataframe.select(variance("Price"), variance("Pages")).show()
Output:
+---------------+---------------+ |var_samp(Price)|var_samp(Pages)| +---------------+---------------+ | 171750.0| 10013.2| +---------------+---------------+
We get the desired output.
You can also use the agg()
function to compute the variance of multiple columns.
# variance of the Price and Pages columns dataframe.agg({'Price': 'variance', 'Pages': 'variance'}).show()
Output:
+---------------+---------------+ |variance(Pages)|variance(Price)| +---------------+---------------+ | 10013.2| 171750.0| +---------------+---------------+
We get the same result as above.
You might also be interested in –
- Pyspark – Standard Deviation of a Column
- Calculate Standard Deviation in Python
- Pandas – Get Standard Deviation of one or more Columns
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.