Pyspark dataframe print schema

Print Pyspark DataFrame Schema

In this tutorial, we will look at how to print the schema of a Pyspark dataframe with the help of some examples.

How to get the schema of a Pyspark dataframe?

Pyspark dataframe print schema

You can use the printSchema() function in Pyspark to print the schema of a dataframe. It displays the column names along with their types. The following is the syntax –

# display dataframe scheme
DataFrame.printSchema()

It displays the dataframe schema in a tree format (and can show nested columns, if present).

Examples

Let’s look at some examples of using the above function to display the schema of a Pypsark dataframe.

#import the pyspark module
import pyspark
  
# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession

# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# books data as list of lists
df = [[1, "PHP", "Sravan", 250],
        [2, "SQL", "Chandra", 300],
        [3, "Python", "Harsha", 250],
        [4, "R", "Rohith", 1200],
        [5, "Hadoop", "Manasa", 700],
        ]
  
# creating dataframe from books data
dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price'])

# display the dataframe schema
dataframe.printSchema()

Output:

root
 |-- Book_Id: long (nullable = true)
 |-- Book_Name: string (nullable = true)
 |-- Author: string (nullable = true)
 |-- Price: long (nullable = true)

Here, we create a dataframe with four columns containing information on some books. None of the columns in the dataframe are nested. You can see that the schema of the dataframe shows the column names and their respective types in a tree format.

Alternatively, you can also use the .schema attribute of a Pyspark dataframe to get its schema.

# display the dataframe schema
dataframe.schema

Output:

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

StructType(List(StructField(Book_Id,LongType,true),StructField(Book_Name,StringType,true),StructField(Author,StringType,true),StructField(Price,LongType,true)))

We get the dataframe schema as output but it’s not in a tree-like output that we got with the printSchema() method.

Schema For Nested Columns in Pyspark

Let’s look at another example. This time let’s create a dataframe having a nested column and see what its schema looks like.

#import the pyspark module
import pyspark
  
# import the  sparksession class  from pyspark.sql
from pyspark.sql import SparkSession

# import types for building schema
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

# create an app from SparkSession class
spark = SparkSession.builder.appName('datascience_parichay').getOrCreate()

# create dataframe schema
schema = StructType([
    StructField("Book_Id", IntegerType()),
    StructField("Book_Name", StringType()),
    StructField("Author", StructType([
                            StructField("First Name", StringType()),
                            StructField("Last Name", StringType())])),
    StructField("Price", IntegerType())
    ])
  
# books data as list of records
df = [[1, 'PHP', ['Sravan', 'Kumar'], 250],
      [2, 'SQL', ['Chandra', 'Sethi'], 300],
      [3, 'Python', ['Harsha', 'Patel'], 250],
      [4, 'R', ['Rohith', 'Samrat'], 1200],
      [5, 'Hadoop', ['Manasa', 'Gopal'], 700]]

# creating dataframe from schema
dataframe = spark.createDataFrame(df, schema)

# display the dataframe schema
dataframe.printSchema()

Output:

root
 |-- Book_Id: integer (nullable = true)
 |-- Book_Name: string (nullable = true)
 |-- Author: struct (nullable = true)
 |    |-- First Name: string (nullable = true)
 |    |-- Last Name: string (nullable = true)
 |-- Price: integer (nullable = true)

The schema shows the dataframe columns and their types. Also, note that the “Author” column has nested columns – “First Name” and “Last Name”.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Authors

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

  • Gottumukkala Sravan Kumar
Scroll to Top