Skip to Content

Sort Pyspark Dataframe on One or More Columns

In this tutorial, we will look at how to sort a Pyspark dataframe on one or more columns with the help of some examples.

How to sort a Pyspark dataframe?

You can use the Pyspark sort() function to sort data in a Pyspark dataframe in ascending or descending order. The following is the syntax –

df.sort(*cols)

Pass the column or the list of columns to sort the dataframe on as an argument. It sorts in ascending order by default but you can specify the order using the optional ascending parameter which is True by default.

Examples

Let’s look at some examples of using the sort() function to sort a Pyspark dataframe.

First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.

# import the pyspark module
import pyspark
  
# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession

# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# books data as list of lists
df = [[1, "PHP", "Sravan", 250],
        [2, "SQL", "Chandra", 300],
        [3, "Python", "Harsha", 250],
        [4, "R", "Rohith", 1200],
        [5, "Hadoop", "Manasa", 700],
        ]
  
# creating dataframe from books data
dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price'])

# display the dataframe
dataframe.show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      1|      PHP| Sravan|  250|
|      2|      SQL|Chandra|  300|
|      3|   Python| Harsha|  250|
|      4|        R| Rohith| 1200|
|      5|   Hadoop| Manasa|  700|
+-------+---------+-------+-----+

We now have a dataframe containing 5 rows and 4 columns with information about different books.

Sort Pyspark dataframe in ascending order

Let’s now sort the above dataframe in ascending order of the “Price” column. For this, we’ll use the Pyspark dataframe sort() function.

# sort on Price in ascending order
dataframe.sort("Price").show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      3|   Python| Harsha|  250|
|      1|      PHP| Sravan|  250|
|      2|      SQL|Chandra|  300|
|      5|   Hadoop| Manasa|  700|
|      4|        R| Rohith| 1200|
+-------+---------+-------+-----+

You can see that the resulting dataframe is sorted in ascending order of “Price”.

Alternatively, you can also combine the sort() function with the col() function to sort a dataframe in ascending or descending order.

# import col function
from pyspark.sql.functions import col

# sort on Price in ascending order
dataframe.sort(col("Price").asc()).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      3|   Python| Harsha|  250|
|      1|      PHP| Sravan|  250|
|      2|      SQL|Chandra|  300|
|      5|   Hadoop| Manasa|  700|
|      4|        R| Rohith| 1200|
+-------+---------+-------+-----+

We get the same result as above.

Sort Pyspark dataframe in descending order

To sort the dataframe in descending order, pass ascending=False to the sort() function. Let’s sort the above dataframe on the “Price” column in descending order.

# sort on Price in descending order
dataframe.sort("Price", ascending=False).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      4|        R| Rohith| 1200|
|      5|   Hadoop| Manasa|  700|
|      2|      SQL|Chandra|  300|
|      1|      PHP| Sravan|  250|
|      3|   Python| Harsha|  250|
+-------+---------+-------+-----+

The dataframe is sorted in descending order of “Price”.

Again, you can also combine the sort() function with the col() function to sort the dataframe in descending order.

# import col function
from pyspark.sql.functions import col

# sort on Price in descending order
dataframe.sort(col("Price").desc()).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      4|        R| Rohith| 1200|
|      5|   Hadoop| Manasa|  700|
|      2|      SQL|Chandra|  300|
|      1|      PHP| Sravan|  250|
|      3|   Python| Harsha|  250|
+-------+---------+-------+-----+

We get the same result as above.

Sort dataframe on multiple columns

Pass the columns to sort the dataframe on as a list to the sort() function. Let’s sort the above dataframe on the “Price” and the “Book_Id” columns.

# sort on Price and Book_Id in ascending order
dataframe.sort(["Price", "Book_Id"]).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      1|      PHP| Sravan|  250|
|      3|   Python| Harsha|  250|
|      2|      SQL|Chandra|  300|
|      5|   Hadoop| Manasa|  700|
|      4|        R| Rohith| 1200|
+-------+---------+-------+-----+

By default, the dataframe is sorted in ascending order taking the two columns into account (first, using the “Price” column and then using the “Book_Id” column).

You can also pass a corresponding list containing the sorting order (ascending or descending) for each column to the ascending parameter. Let’s sort the above dataframe on “Price” in descending order and then on “Book_Id” in ascending order.

# sort on Price and Book_Id in ascending order
dataframe.sort(["Price", "Book_Id"], ascending=[False, True]).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      4|        R| Rohith| 1200|
|      5|   Hadoop| Manasa|  700|
|      2|      SQL|Chandra|  300|
|      1|      PHP| Sravan|  250|
|      3|   Python| Harsha|  250|
+-------+---------+-------+-----+

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Authors

  • Piyush

    Piyush is a data scientist passionate about using data to understand things better and make informed decisions. In the past, he's worked as a Data Scientist for ZS and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

  • Gottumukkala Sravan Kumar