In this tutorial, we will look at how to add a new column to Pyspark dataframe with the help of some examples.
How to add a new column to a Pyspark dataframe?

You can use the Pyspark withColumn()
function to add a new column to a Pyspark dataframe. The following is the syntax –
# add new column DataFrame.withColumn(colName, col)
Here, colName
is the name of the new column and col
is a column expression. It returns a Pypspark dataframe with the new column added.
Examples
Let’s look at some examples of adding new columns to an existing Pyspark dataframe. First, we will create a Pyspark dataframe that we will be using throughout this tutorial.
# import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # data of items sold data = [[1, "Soap", 20, 3], [1, "Shampoo", 50, 1], [2, "Toothpaste", 40, 2], [3, "Juice", 35, 4], [3, "Face Wash", 30, 1]] # create a Pyspark dataframe using the above data df = spark.createDataFrame(data, ["Customer Id", "Item", "Price", "Quantity"]) # display df.show()
Output:
Highlighted programs for you
Flatiron School
Flatiron School
University of Maryland Global Campus
University of Maryland Global Campus
Creighton University
Creighton University
+-----------+----------+-----+--------+ |Customer Id| Item|Price|Quantity| +-----------+----------+-----+--------+ | 1| Soap| 20| 3| | 1| Shampoo| 50| 1| | 2|Toothpaste| 40| 2| | 3| Juice| 35| 4| | 3| Face Wash| 30| 1| +-----------+----------+-----+--------+
We now have a dataframe containing information on items purchased by some customers at a supermarket. The dataframe has information on the customer id, item name, price, and the quantity purchased.
Add a column with a constant value
Let’s use the withColumn()
function to add a column for the discount rate for the items, which is at 10% for all the items in this supermarket. To add a column with a constant value use the lit()
function (available in pyspark.sql.functions
) along with the withColumn()
function.
from pyspark.sql.functions import lit # add column for discount df = df.withColumn("Discount Rate", lit(0.10)) # display the dataframe df.show()
Output:
+-----------+----------+-----+--------+-------------+ |Customer Id| Item|Price|Quantity|Discount Rate| +-----------+----------+-----+--------+-------------+ | 1| Soap| 20| 3| 0.1| | 1| Shampoo| 50| 1| 0.1| | 2|Toothpaste| 40| 2| 0.1| | 3| Juice| 35| 4| 0.1| | 3| Face Wash| 30| 1| 0.1| +-----------+----------+-----+--------+-------------+
You can see that the dataframe now has an additional column, “Discount Rate” having a constant value of 0.1 for all the records.
Add a column using another column from the dataframe in Pyspark
You can also use the withColumn()
function to create a column using values from another column. For example, a column resulting from an arithmetic operation on existing column(s).
Let’s add a column for the total price which is equal to the item price x item quantity.
# add column for total df = df.withColumn("Total", df["Price"]*df["Quantity"]) # display the dataframe df.show()
Output:
+-----------+----------+-----+--------+-------------+-----+ |Customer Id| Item|Price|Quantity|Discount Rate|Total| +-----------+----------+-----+--------+-------------+-----+ | 1| Soap| 20| 3| 0.1| 60| | 1| Shampoo| 50| 1| 0.1| 50| | 2|Toothpaste| 40| 2| 0.1| 80| | 3| Juice| 35| 4| 0.1| 140| | 3| Face Wash| 30| 1| 0.1| 30| +-----------+----------+-----+--------+-------------+-----+
You can see that the resulting dataframe has an additional column, “Total” containing the total value of the item purchased without the discount.
You might also be interested in –
- Rename DataFrame Column Name in Pyspark
- Filter PySpark DataFrame with where()
- Display DataFrame in Pyspark with show()
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.