In this tutorial, we will look at how to sort a Pyspark dataframe on one or more columns with the help of some examples.
How to sort a Pyspark dataframe?
You can use the Pyspark sort()
function to sort data in a Pyspark dataframe in ascending or descending order. The following is the syntax –
df.sort(*cols)
Pass the column or the list of columns to sort the dataframe on as an argument. It sorts in ascending order by default but you can specify the order using the optional ascending
parameter which is True
by default.
Examples
Let’s look at some examples of using the sort()
function to sort a Pyspark dataframe.
First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.
# import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # books data as list of lists df = [[1, "PHP", "Sravan", 250], [2, "SQL", "Chandra", 300], [3, "Python", "Harsha", 250], [4, "R", "Rohith", 1200], [5, "Hadoop", "Manasa", 700], ] # creating dataframe from books data dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price']) # display the dataframe dataframe.show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 1| PHP| Sravan| 250| | 2| SQL|Chandra| 300| | 3| Python| Harsha| 250| | 4| R| Rohith| 1200| | 5| Hadoop| Manasa| 700| +-------+---------+-------+-----+
We now have a dataframe containing 5 rows and 4 columns with information about different books.
Sort Pyspark dataframe in ascending order
Let’s now sort the above dataframe in ascending order of the “Price” column. For this, we’ll use the Pyspark dataframe sort()
function.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
# sort on Price in ascending order dataframe.sort("Price").show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 3| Python| Harsha| 250| | 1| PHP| Sravan| 250| | 2| SQL|Chandra| 300| | 5| Hadoop| Manasa| 700| | 4| R| Rohith| 1200| +-------+---------+-------+-----+
You can see that the resulting dataframe is sorted in ascending order of “Price”.
Alternatively, you can also combine the sort()
function with the col()
function to sort a dataframe in ascending or descending order.
# import col function from pyspark.sql.functions import col # sort on Price in ascending order dataframe.sort(col("Price").asc()).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 3| Python| Harsha| 250| | 1| PHP| Sravan| 250| | 2| SQL|Chandra| 300| | 5| Hadoop| Manasa| 700| | 4| R| Rohith| 1200| +-------+---------+-------+-----+
We get the same result as above.
Sort Pyspark dataframe in descending order
To sort the dataframe in descending order, pass ascending=False
to the sort()
function. Let’s sort the above dataframe on the “Price” column in descending order.
# sort on Price in descending order dataframe.sort("Price", ascending=False).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 4| R| Rohith| 1200| | 5| Hadoop| Manasa| 700| | 2| SQL|Chandra| 300| | 1| PHP| Sravan| 250| | 3| Python| Harsha| 250| +-------+---------+-------+-----+
The dataframe is sorted in descending order of “Price”.
Again, you can also combine the sort()
function with the col()
function to sort the dataframe in descending order.
# import col function from pyspark.sql.functions import col # sort on Price in descending order dataframe.sort(col("Price").desc()).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 4| R| Rohith| 1200| | 5| Hadoop| Manasa| 700| | 2| SQL|Chandra| 300| | 1| PHP| Sravan| 250| | 3| Python| Harsha| 250| +-------+---------+-------+-----+
We get the same result as above.
Sort dataframe on multiple columns
Pass the columns to sort the dataframe on as a list to the sort()
function. Let’s sort the above dataframe on the “Price” and the “Book_Id” columns.
# sort on Price and Book_Id in ascending order dataframe.sort(["Price", "Book_Id"]).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 1| PHP| Sravan| 250| | 3| Python| Harsha| 250| | 2| SQL|Chandra| 300| | 5| Hadoop| Manasa| 700| | 4| R| Rohith| 1200| +-------+---------+-------+-----+
By default, the dataframe is sorted in ascending order taking the two columns into account (first, using the “Price” column and then using the “Book_Id” column).
You can also pass a corresponding list containing the sorting order (ascending or descending) for each column to the ascending
parameter. Let’s sort the above dataframe on “Price” in descending order and then on “Book_Id” in ascending order.
# sort on Price and Book_Id in ascending order dataframe.sort(["Price", "Book_Id"], ascending=[False, True]).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 4| R| Rohith| 1200| | 5| Hadoop| Manasa| 700| | 2| SQL|Chandra| 300| | 1| PHP| Sravan| 250| | 3| Python| Harsha| 250| +-------+---------+-------+-----+
You might also be interested in –
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.