sort a pyspark dataframe

Sort Pyspark Dataframe on One or More Columns

In this tutorial, we will look at how to sort a Pyspark dataframe on one or more columns with the help of some examples.

How to sort a Pyspark dataframe?

You can use the Pyspark sort() function to sort data in a Pyspark dataframe in ascending or descending order. The following is the syntax –

df.sort(*cols)

Pass the column or the list of columns to sort the dataframe on as an argument. It sorts in ascending order by default but you can specify the order using the optional ascending parameter which is True by default.

Examples

Let’s look at some examples of using the sort() function to sort a Pyspark dataframe.

First, let’s create a sample Pyspark dataframe that we will be using throughout this tutorial.

# import the pyspark module
import pyspark
  
# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession

# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# books data as list of lists
df = [[1, "PHP", "Sravan", 250],
        [2, "SQL", "Chandra", 300],
        [3, "Python", "Harsha", 250],
        [4, "R", "Rohith", 1200],
        [5, "Hadoop", "Manasa", 700],
        ]
  
# creating dataframe from books data
dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price'])

# display the dataframe
dataframe.show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      1|      PHP| Sravan|  250|
|      2|      SQL|Chandra|  300|
|      3|   Python| Harsha|  250|
|      4|        R| Rohith| 1200|
|      5|   Hadoop| Manasa|  700|
+-------+---------+-------+-----+

We now have a dataframe containing 5 rows and 4 columns with information about different books.

Sort Pyspark dataframe in ascending order

Let’s now sort the above dataframe in ascending order of the “Price” column. For this, we’ll use the Pyspark dataframe sort() function.

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

# sort on Price in ascending order
dataframe.sort("Price").show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      3|   Python| Harsha|  250|
|      1|      PHP| Sravan|  250|
|      2|      SQL|Chandra|  300|
|      5|   Hadoop| Manasa|  700|
|      4|        R| Rohith| 1200|
+-------+---------+-------+-----+

You can see that the resulting dataframe is sorted in ascending order of “Price”.

Alternatively, you can also combine the sort() function with the col() function to sort a dataframe in ascending or descending order.

# import col function
from pyspark.sql.functions import col

# sort on Price in ascending order
dataframe.sort(col("Price").asc()).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      3|   Python| Harsha|  250|
|      1|      PHP| Sravan|  250|
|      2|      SQL|Chandra|  300|
|      5|   Hadoop| Manasa|  700|
|      4|        R| Rohith| 1200|
+-------+---------+-------+-----+

We get the same result as above.

Sort Pyspark dataframe in descending order

To sort the dataframe in descending order, pass ascending=False to the sort() function. Let’s sort the above dataframe on the “Price” column in descending order.

# sort on Price in descending order
dataframe.sort("Price", ascending=False).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      4|        R| Rohith| 1200|
|      5|   Hadoop| Manasa|  700|
|      2|      SQL|Chandra|  300|
|      1|      PHP| Sravan|  250|
|      3|   Python| Harsha|  250|
+-------+---------+-------+-----+

The dataframe is sorted in descending order of “Price”.

Again, you can also combine the sort() function with the col() function to sort the dataframe in descending order.

# import col function
from pyspark.sql.functions import col

# sort on Price in descending order
dataframe.sort(col("Price").desc()).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      4|        R| Rohith| 1200|
|      5|   Hadoop| Manasa|  700|
|      2|      SQL|Chandra|  300|
|      1|      PHP| Sravan|  250|
|      3|   Python| Harsha|  250|
+-------+---------+-------+-----+

We get the same result as above.

Sort dataframe on multiple columns

Pass the columns to sort the dataframe on as a list to the sort() function. Let’s sort the above dataframe on the “Price” and the “Book_Id” columns.

# sort on Price and Book_Id in ascending order
dataframe.sort(["Price", "Book_Id"]).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      1|      PHP| Sravan|  250|
|      3|   Python| Harsha|  250|
|      2|      SQL|Chandra|  300|
|      5|   Hadoop| Manasa|  700|
|      4|        R| Rohith| 1200|
+-------+---------+-------+-----+

By default, the dataframe is sorted in ascending order taking the two columns into account (first, using the “Price” column and then using the “Book_Id” column).

You can also pass a corresponding list containing the sorting order (ascending or descending) for each column to the ascending parameter. Let’s sort the above dataframe on “Price” in descending order and then on “Book_Id” in ascending order.

# sort on Price and Book_Id in ascending order
dataframe.sort(["Price", "Book_Id"], ascending=[False, True]).show()

Output:

+-------+---------+-------+-----+
|Book_Id|Book_Name| Author|Price|
+-------+---------+-------+-----+
|      4|        R| Rohith| 1200|
|      5|   Hadoop| Manasa|  700|
|      2|      SQL|Chandra|  300|
|      1|      PHP| Sravan|  250|
|      3|   Python| Harsha|  250|
+-------+---------+-------+-----+

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Authors

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

  • Gottumukkala Sravan Kumar
Scroll to Top