In this tutorial, we will look at how to add a new column to Pyspark dataframe with the help of some examples.
How to add a new column to a Pyspark dataframe?
You can use the Pyspark withColumn()
function to add a new column to a Pyspark dataframe. The following is the syntax –
# add new column DataFrame.withColumn(colName, col)
Here, colName
is the name of the new column and col
is a column expression. It returns a Pypspark dataframe with the new column added.
Examples
Let’s look at some examples of adding new columns to an existing Pyspark dataframe. First, we will create a Pyspark dataframe that we will be using throughout this tutorial.
# import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # data of items sold data = [[1, "Soap", 20, 3], [1, "Shampoo", 50, 1], [2, "Toothpaste", 40, 2], [3, "Juice", 35, 4], [3, "Face Wash", 30, 1]] # create a Pyspark dataframe using the above data df = spark.createDataFrame(data, ["Customer Id", "Item", "Price", "Quantity"]) # display df.show()
Output:
+-----------+----------+-----+--------+ |Customer Id| Item|Price|Quantity| +-----------+----------+-----+--------+ | 1| Soap| 20| 3| | 1| Shampoo| 50| 1| | 2|Toothpaste| 40| 2| | 3| Juice| 35| 4| | 3| Face Wash| 30| 1| +-----------+----------+-----+--------+
We now have a dataframe containing information on items purchased by some customers at a supermarket. The dataframe has information on the customer id, item name, price, and the quantity purchased.
Add a column with a constant value
Let’s use the withColumn()
function to add a column for the discount rate for the items, which is at 10% for all the items in this supermarket. To add a column with a constant value use the lit()
function (available in pyspark.sql.functions
) along with the withColumn()
function.
from pyspark.sql.functions import lit # add column for discount df = df.withColumn("Discount Rate", lit(0.10)) # display the dataframe df.show()
Output:
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
+-----------+----------+-----+--------+-------------+ |Customer Id| Item|Price|Quantity|Discount Rate| +-----------+----------+-----+--------+-------------+ | 1| Soap| 20| 3| 0.1| | 1| Shampoo| 50| 1| 0.1| | 2|Toothpaste| 40| 2| 0.1| | 3| Juice| 35| 4| 0.1| | 3| Face Wash| 30| 1| 0.1| +-----------+----------+-----+--------+-------------+
You can see that the dataframe now has an additional column, “Discount Rate” having a constant value of 0.1 for all the records.
Add a column using another column from the dataframe in Pyspark
You can also use the withColumn()
function to create a column using values from another column. For example, a column resulting from an arithmetic operation on existing column(s).
Let’s add a column for the total price which is equal to the item price x item quantity.
# add column for total df = df.withColumn("Total", df["Price"]*df["Quantity"]) # display the dataframe df.show()
Output:
+-----------+----------+-----+--------+-------------+-----+ |Customer Id| Item|Price|Quantity|Discount Rate|Total| +-----------+----------+-----+--------+-------------+-----+ | 1| Soap| 20| 3| 0.1| 60| | 1| Shampoo| 50| 1| 0.1| 50| | 2|Toothpaste| 40| 2| 0.1| 80| | 3| Juice| 35| 4| 0.1| 140| | 3| Face Wash| 30| 1| 0.1| 30| +-----------+----------+-----+--------+-------------+-----+
You can see that the resulting dataframe has an additional column, “Total” containing the total value of the item purchased without the discount.
You might also be interested in –
- Rename DataFrame Column Name in Pyspark
- Filter PySpark DataFrame with where()
- Display DataFrame in Pyspark with show()
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.