In this tutorial, we will look at how to sort a Pyspark dataframe on one or more columns with the help of some examples.
How to sort a Pyspark dataframe?
You can use the Pyspark sort()
function to sort data in a Pyspark dataframe in ascending or descending order. The following is the syntax –
df.sort(*cols)
Pass the column or the list of columns to sort the dataframe on as an argument. It sorts in ascending order by default but you can specify the order using the optional ascending
parameter which is True
by default.
Examples
Let’s look at some examples of using the sort()
function to sort a Pyspark dataframe.
First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.
# import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # books data as list of lists df = [[1, "PHP", "Sravan", 250], [2, "SQL", "Chandra", 300], [3, "Python", "Harsha", 250], [4, "R", "Rohith", 1200], [5, "Hadoop", "Manasa", 700], ] # creating dataframe from books data dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price']) # display the dataframe dataframe.show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 1| PHP| Sravan| 250| | 2| SQL|Chandra| 300| | 3| Python| Harsha| 250| | 4| R| Rohith| 1200| | 5| Hadoop| Manasa| 700| +-------+---------+-------+-----+
We now have a dataframe containing 5 rows and 4 columns with information about different books.
Sort Pyspark dataframe in ascending order
Let’s now sort the above dataframe in ascending order of the “Price” column. For this, we’ll use the Pyspark dataframe sort()
function.
# sort on Price in ascending order dataframe.sort("Price").show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 3| Python| Harsha| 250| | 1| PHP| Sravan| 250| | 2| SQL|Chandra| 300| | 5| Hadoop| Manasa| 700| | 4| R| Rohith| 1200| +-------+---------+-------+-----+
You can see that the resulting dataframe is sorted in ascending order of “Price”.
Alternatively, you can also combine the sort()
function with the col()
function to sort a dataframe in ascending or descending order.
# import col function from pyspark.sql.functions import col # sort on Price in ascending order dataframe.sort(col("Price").asc()).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 3| Python| Harsha| 250| | 1| PHP| Sravan| 250| | 2| SQL|Chandra| 300| | 5| Hadoop| Manasa| 700| | 4| R| Rohith| 1200| +-------+---------+-------+-----+
We get the same result as above.
Sort Pyspark dataframe in descending order
To sort the dataframe in descending order, pass ascending=False
to the sort()
function. Let’s sort the above dataframe on the “Price” column in descending order.
# sort on Price in descending order dataframe.sort("Price", ascending=False).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 4| R| Rohith| 1200| | 5| Hadoop| Manasa| 700| | 2| SQL|Chandra| 300| | 1| PHP| Sravan| 250| | 3| Python| Harsha| 250| +-------+---------+-------+-----+
The dataframe is sorted in descending order of “Price”.
Again, you can also combine the sort()
function with the col()
function to sort the dataframe in descending order.
# import col function from pyspark.sql.functions import col # sort on Price in descending order dataframe.sort(col("Price").desc()).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 4| R| Rohith| 1200| | 5| Hadoop| Manasa| 700| | 2| SQL|Chandra| 300| | 1| PHP| Sravan| 250| | 3| Python| Harsha| 250| +-------+---------+-------+-----+
We get the same result as above.
Sort dataframe on multiple columns
Pass the columns to sort the dataframe on as a list to the sort()
function. Let’s sort the above dataframe on the “Price” and the “Book_Id” columns.
# sort on Price and Book_Id in ascending order dataframe.sort(["Price", "Book_Id"]).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 1| PHP| Sravan| 250| | 3| Python| Harsha| 250| | 2| SQL|Chandra| 300| | 5| Hadoop| Manasa| 700| | 4| R| Rohith| 1200| +-------+---------+-------+-----+
By default, the dataframe is sorted in ascending order taking the two columns into account (first, using the “Price” column and then using the “Book_Id” column).
You can also pass a corresponding list containing the sorting order (ascending or descending) for each column to the ascending
parameter. Let’s sort the above dataframe on “Price” in descending order and then on “Book_Id” in ascending order.
# sort on Price and Book_Id in ascending order dataframe.sort(["Price", "Book_Id"], ascending=[False, True]).show()
Output:
+-------+---------+-------+-----+ |Book_Id|Book_Name| Author|Price| +-------+---------+-------+-----+ | 4| R| Rohith| 1200| | 5| Hadoop| Manasa| 700| | 2| SQL|Chandra| 300| | 1| PHP| Sravan| 250| | 3| Python| Harsha| 250| +-------+---------+-------+-----+
You might also be interested in –
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.