Pyspark - Add a New Column to a DataFrame

In this tutorial, we will look at how to add a new column to Pyspark dataframe with the help of some examples.

How to add a new column to a Pyspark dataframe?

You can use the Pyspark withColumn() function to add a new column to a Pyspark dataframe. The following is the syntax –

# add new column
DataFrame.withColumn(colName, col)

Here, colName is the name of the new column and col is a column expression. It returns a Pypspark dataframe with the new column added.

Examples

Let’s look at some examples of adding new columns to an existing Pyspark dataframe. First, we will create a Pyspark dataframe that we will be using throughout this tutorial.

# import the pyspark module
import pyspark
  
# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession
  
# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# data of items sold 
data = [[1, "Soap", 20, 3],
        [1, "Shampoo", 50, 1],
        [2, "Toothpaste", 40, 2],
        [3, "Juice", 35, 4],
        [3, "Face Wash", 30, 1]]

# create a Pyspark dataframe using the above data
df = spark.createDataFrame(data, ["Customer Id", "Item", "Price", "Quantity"])

# display 
df.show()

Output:

+-----------+----------+-----+--------+
|Customer Id|      Item|Price|Quantity|
+-----------+----------+-----+--------+
|          1|      Soap|   20|       3|
|          1|   Shampoo|   50|       1|
|          2|Toothpaste|   40|       2|
|          3|     Juice|   35|       4|
|          3| Face Wash|   30|       1|
+-----------+----------+-----+--------+

We now have a dataframe containing information on items purchased by some customers at a supermarket. The dataframe has information on the customer id, item name, price, and the quantity purchased.

Add a column with a constant value

Let’s use the withColumn() function to add a column for the discount rate for the items, which is at 10% for all the items in this supermarket. To add a column with a constant value use the lit() function (available in pyspark.sql.functions) along with the withColumn() function.

from pyspark.sql.functions import lit

# add column for discount
df = df.withColumn("Discount Rate", lit(0.10))
# display the dataframe
df.show()

Output:

📚 Data Science Programs By Skill Level

Introductory ⭐

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

+-----------+----------+-----+--------+-------------+
|Customer Id|      Item|Price|Quantity|Discount Rate|
+-----------+----------+-----+--------+-------------+
|          1|      Soap|   20|       3|          0.1|
|          1|   Shampoo|   50|       1|          0.1|
|          2|Toothpaste|   40|       2|          0.1|
|          3|     Juice|   35|       4|          0.1|
|          3| Face Wash|   30|       1|          0.1|
+-----------+----------+-----+--------+-------------+

You can see that the dataframe now has an additional column, “Discount Rate” having a constant value of 0.1 for all the records.

Add a column using another column from the dataframe in Pyspark

You can also use the withColumn() function to create a column using values from another column. For example, a column resulting from an arithmetic operation on existing column(s).

Let’s add a column for the total price which is equal to the item price x item quantity.

# add column for total
df = df.withColumn("Total", df["Price"]*df["Quantity"])
# display the dataframe
df.show()

Output:

+-----------+----------+-----+--------+-------------+-----+
|Customer Id|      Item|Price|Quantity|Discount Rate|Total|
+-----------+----------+-----+--------+-------------+-----+
|          1|      Soap|   20|       3|          0.1|   60|
|          1|   Shampoo|   50|       1|          0.1|   50|
|          2|Toothpaste|   40|       2|          0.1|   80|
|          3|     Juice|   35|       4|          0.1|  140|
|          3| Face Wash|   30|       1|          0.1|   30|
+-----------+----------+-----+--------+-------------+-----+

You can see that the resulting dataframe has an additional column, “Total” containing the total value of the item purchased without the discount.

Authors

Piyush Raj

Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

View all posts
Gottumukkala Sravan Kumar

View all posts