In this tutorial, we will look at how to search for a string (or a substring) in a pandas dataframe column with the help of some examples.
How to check if a pandas series contains a string?
You can use the pandas.series.str.contains()
function to search for the presence of a string in a pandas series (or column of a dataframe). You can also pass a regex to check for more custom patterns in the series values. The following is the syntax:
# usnig pd.Series.str.contains() function with default parameters df['Col'].str.contains("string_or_pattern", case=True, flags=0, na=None, regex=True)
It returns a boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.
The case
parameter tells whether to match the string in a case-sensitive manner or not.
The regex
parameter tells the function that you want to match for a specific regex pattern.
The flags
parameter can be used to pass additional flags for the regex match through to the re
module (for example re.IGNORECASE
)
Let’s look at some examples to see the above syntax in action
Search for string in pandas column or series
Pass the string you want to check for as an argument.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
import pandas as pd # create a pandas series players = pd.Series(['Rahul Dravid', 'Yuvraj Singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli']) # names with 'Singh' print(players.str.contains('Singh', regex=False))
Output:
0 False 1 True 2 False 3 True 4 False dtype: bool
Here, we created a pandas series containing names of some India’s top cricketers. We then find the names containing the word “Singh” using the str.contains()
function. We also pass regex=False
to indicate not to assume the passed value as a regex pattern. In this case, you can also go with the default regex=True
as it would not make any difference.
Also note that we get the result as a pandas series of boolean values representing which of the values contained the given string. You can use this series to filter values in the original series.
For example, let’s only print out the names containing the word “Singh”
# display the type type(players.str.contains('Singh')) # filter for names containing 'Singh' print(players[players.str.contains('Singh')])
Output:
1 Yuvraj Singh 3 Mahendra Singh Dhoni dtype: object
Here we applied the .str.contains()
function on a pandas series. Note that you can also apply it on individual columns of a pandas dataframe.
# create a dataframe df = pd.DataFrame({ 'Name': ['Rahul Dravid', 'Yuvraj Singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli'], 'IPL Team': ['RR', 'KXIP', 'MI', 'CSK', 'RCB'] }) # filter for names that have "Singh" print(df[df['Name'].str.contains('Singh', regex=False)])
Output:
Name IPL Team 1 Yuvraj Singh KXIP 3 Mahendra Singh Dhoni CSK
Search for string irrespective of case
By default, the pd.series.str.contains()
function’s string searches are case sensitive.
# create a pandas series players = pd.Series(['Rahul Dravid', 'yuvraj singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli']) # names with 'Singh' irrespective of case print(players.str.contains('Singh', regex=False))
Output:
0 False 1 False 2 False 3 True 4 False dtype: bool
We get False for “yuvraj singh” because it does not contain the word “Singh” in the same case.
You can, however make the function search for strings irrespective of the case by passing False
to the case
parameter.
# create a pandas series players = pd.Series(['Rahul Dravid', 'yuvraj singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli']) # names with 'Singh' irrespective of case print(players.str.contains('Singh', regex=False, case=False))
Output:
0 False 1 True 2 False 3 True 4 False dtype: bool
Search for a matching regex pattern in column
You can also pass regex patterns to the above function for searching more complex values/patterns in the series.
# create a pandas series balls = pd.Series(['wide', 'no ball', 'wicket', 'dot ball', 'runs']) # check for wickets or dot balls good_balls = balls.str.contains('wicket|dot ball', regex=True) # display good balls print(good_balls)
Output:
0 False 1 False 2 True 3 True 4 False dtype: bool
Here we created a pandas series with values representing different outcomes when a blower bowls a ball in cricket. Let’s say we want to find all the good balls which can be defined as either a wicket or a dot ball. We used the regex pattern 'wicket|dot ball'
to match with either “wicket” or “dot ball”.
You can similarly write more complex regex patterns depending on your use-case to match values in a pandas series.
For more the pd.Series.str.contains() function, refer to its documentation.
With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.