In this tutorial, we will look at how to print the schema of a Pyspark dataframe with the help of some examples.
How to get the schema of a Pyspark dataframe?

You can use the printSchema()
function in Pyspark to print the schema of a dataframe. It displays the column names along with their types. The following is the syntax –
# display dataframe scheme DataFrame.printSchema()
It displays the dataframe schema in a tree format (and can show nested columns, if present).
Examples
Let’s look at some examples of using the above function to display the schema of a Pypsark dataframe.
#import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # books data as list of lists df = [[1, "PHP", "Sravan", 250], [2, "SQL", "Chandra", 300], [3, "Python", "Harsha", 250], [4, "R", "Rohith", 1200], [5, "Hadoop", "Manasa", 700], ] # creating dataframe from books data dataframe = spark.createDataFrame(df, ['Book_Id', 'Book_Name', 'Author', 'Price']) # display the dataframe schema dataframe.printSchema()
Output:
root |-- Book_Id: long (nullable = true) |-- Book_Name: string (nullable = true) |-- Author: string (nullable = true) |-- Price: long (nullable = true)
Here, we create a dataframe with four columns containing information on some books. None of the columns in the dataframe are nested. You can see that the schema of the dataframe shows the column names and their respective types in a tree format.
Alternatively, you can also use the .schema
attribute of a Pyspark dataframe to get its schema.
# display the dataframe schema dataframe.schema
Output:
StructType(List(StructField(Book_Id,LongType,true),StructField(Book_Name,StringType,true),StructField(Author,StringType,true),StructField(Price,LongType,true)))
We get the dataframe schema as output but it’s not in a tree-like output that we got with the printSchema()
method.
Schema For Nested Columns in Pyspark
Let’s look at another example. This time let’s create a dataframe having a nested column and see what its schema looks like.
#import the pyspark module import pyspark # import the sparksession class from pyspark.sql from pyspark.sql import SparkSession # import types for building schema from pyspark.sql.types import StructType,StructField, StringType, IntegerType # create an app from SparkSession class spark = SparkSession.builder.appName('datascience_parichay').getOrCreate() # create dataframe schema schema = StructType([ StructField("Book_Id", IntegerType()), StructField("Book_Name", StringType()), StructField("Author", StructType([ StructField("First Name", StringType()), StructField("Last Name", StringType())])), StructField("Price", IntegerType()) ]) # books data as list of records df = [[1, 'PHP', ['Sravan', 'Kumar'], 250], [2, 'SQL', ['Chandra', 'Sethi'], 300], [3, 'Python', ['Harsha', 'Patel'], 250], [4, 'R', ['Rohith', 'Samrat'], 1200], [5, 'Hadoop', ['Manasa', 'Gopal'], 700]] # creating dataframe from schema dataframe = spark.createDataFrame(df, schema) # display the dataframe schema dataframe.printSchema()
Output:
root |-- Book_Id: integer (nullable = true) |-- Book_Name: string (nullable = true) |-- Author: struct (nullable = true) | |-- First Name: string (nullable = true) | |-- Last Name: string (nullable = true) |-- Price: integer (nullable = true)
The schema shows the dataframe columns and their types. Also, note that the “Author” column has nested columns – “First Name” and “Last Name”.
You might also be interested in –
- Get DataFrame Records with Pyspark collect()
- Display DataFrame in Pyspark with show()
- Rename DataFrame Column Name in Pyspark
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.