add new column to a pyspark dataframe

Pyspark – Add a New Column to a DataFrame

In this tutorial, we will look at how to add a new column to Pyspark dataframe with the help of some examples.

How to add a new column to a Pyspark dataframe?

add new column to a pyspark dataframe

You can use the Pyspark withColumn() function to add a new column to a Pyspark dataframe. The following is the syntax –

# add new column
DataFrame.withColumn(colName, col)

Here, colName is the name of the new column and col is a column expression. It returns a Pypspark dataframe with the new column added.

Examples

Let’s look at some examples of adding new columns to an existing Pyspark dataframe. First, we will create a Pyspark dataframe that we will be using throughout this tutorial.

# import the pyspark module
import pyspark
  
# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession
  
# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# data of items sold 
data = [[1, "Soap", 20, 3],
        [1, "Shampoo", 50, 1],
        [2, "Toothpaste", 40, 2],
        [3, "Juice", 35, 4],
        [3, "Face Wash", 30, 1]]

# create a Pyspark dataframe using the above data
df = spark.createDataFrame(data, ["Customer Id", "Item", "Price", "Quantity"])

# display 
df.show()

Output:

+-----------+----------+-----+--------+
|Customer Id|      Item|Price|Quantity|
+-----------+----------+-----+--------+
|          1|      Soap|   20|       3|
|          1|   Shampoo|   50|       1|
|          2|Toothpaste|   40|       2|
|          3|     Juice|   35|       4|
|          3| Face Wash|   30|       1|
+-----------+----------+-----+--------+

We now have a dataframe containing information on items purchased by some customers at a supermarket. The dataframe has information on the customer id, item name, price, and the quantity purchased.

Add a column with a constant value

Let’s use the withColumn() function to add a column for the discount rate for the items, which is at 10% for all the items in this supermarket. To add a column with a constant value use the lit() function (available in pyspark.sql.functions) along with the withColumn() function.

from pyspark.sql.functions import lit

# add column for discount
df = df.withColumn("Discount Rate", lit(0.10))
# display the dataframe
df.show()

Output:

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

+-----------+----------+-----+--------+-------------+
|Customer Id|      Item|Price|Quantity|Discount Rate|
+-----------+----------+-----+--------+-------------+
|          1|      Soap|   20|       3|          0.1|
|          1|   Shampoo|   50|       1|          0.1|
|          2|Toothpaste|   40|       2|          0.1|
|          3|     Juice|   35|       4|          0.1|
|          3| Face Wash|   30|       1|          0.1|
+-----------+----------+-----+--------+-------------+

You can see that the dataframe now has an additional column, “Discount Rate” having a constant value of 0.1 for all the records.

Add a column using another column from the dataframe in Pyspark

You can also use the withColumn() function to create a column using values from another column. For example, a column resulting from an arithmetic operation on existing column(s).

Let’s add a column for the total price which is equal to the item price x item quantity.

# add column for total
df = df.withColumn("Total", df["Price"]*df["Quantity"])
# display the dataframe
df.show()

Output:

+-----------+----------+-----+--------+-------------+-----+
|Customer Id|      Item|Price|Quantity|Discount Rate|Total|
+-----------+----------+-----+--------+-------------+-----+
|          1|      Soap|   20|       3|          0.1|   60|
|          1|   Shampoo|   50|       1|          0.1|   50|
|          2|Toothpaste|   40|       2|          0.1|   80|
|          3|     Juice|   35|       4|          0.1|  140|
|          3| Face Wash|   30|       1|          0.1|   30|
+-----------+----------+-----+--------+-------------+-----+

You can see that the resulting dataframe has an additional column, “Total” containing the total value of the item purchased without the discount.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Authors

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

  • Gottumukkala Sravan Kumar
Scroll to Top