Search for string in a pandas column

Pandas – Search for String in DataFrame Column

In this tutorial, we will look at how to search for a string (or a substring) in a pandas dataframe column with the help of some examples.

Search for string in a pandas column

You can use the pandas.series.str.contains() function to search for the presence of a string in a pandas series (or column of a dataframe). You can also pass a regex to check for more custom patterns in the series values. The following is the syntax:

# usnig pd.Series.str.contains() function with default parameters
df['Col'].str.contains("string_or_pattern", case=True, flags=0, na=None, regex=True)

It returns a boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

The case parameter tells whether to match the string in a case-sensitive manner or not.

The regex parameter tells the function that you want to match for a specific regex pattern.

The flags parameter can be used to pass additional flags for the regex match through to the re module (for example re.IGNORECASE)

Let’s look at some examples to see the above syntax in action

Pass the string you want to check for as an argument.

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

import pandas as pd

# create a pandas series
players = pd.Series(['Rahul Dravid', 'Yuvraj Singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli'])
# names with 'Singh'
print(players.str.contains('Singh', regex=False))

Output:

0    False
1     True
2    False
3     True
4    False
dtype: bool

Here, we created a pandas series containing names of some India’s top cricketers. We then find the names containing the word “Singh” using the str.contains() function. We also pass regex=False to indicate not to assume the passed value as a regex pattern. In this case, you can also go with the default regex=True as it would not make any difference.

Also note that we get the result as a pandas series of boolean values representing which of the values contained the given string. You can use this series to filter values in the original series.

For example, let’s only print out the names containing the word “Singh”

# display the type
type(players.str.contains('Singh'))
# filter for names containing 'Singh'
print(players[players.str.contains('Singh')])

Output:

1            Yuvraj Singh
3    Mahendra Singh Dhoni
dtype: object

Here we applied the .str.contains() function on a pandas series. Note that you can also apply it on individual columns of a pandas dataframe.

# create a dataframe
df = pd.DataFrame({
    'Name': ['Rahul Dravid', 'Yuvraj Singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli'],
    'IPL Team': ['RR', 'KXIP', 'MI', 'CSK', 'RCB']
})

# filter for names that have "Singh"
print(df[df['Name'].str.contains('Singh', regex=False)])

Output:

                   Name IPL Team
1          Yuvraj Singh     KXIP
3  Mahendra Singh Dhoni      CSK

By default, the pd.series.str.contains() function’s string searches are case sensitive.

# create a pandas series
players = pd.Series(['Rahul Dravid', 'yuvraj singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli'])
# names with 'Singh' irrespective of case
print(players.str.contains('Singh', regex=False))

Output:

0    False
1    False
2    False
3     True
4    False
dtype: bool

We get False for “yuvraj singh” because it does not contain the word “Singh” in the same case.

You can, however make the function search for strings irrespective of the case by passing False to the case parameter.

# create a pandas series
players = pd.Series(['Rahul Dravid', 'yuvraj singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli'])
# names with 'Singh' irrespective of case
print(players.str.contains('Singh', regex=False, case=False))

Output:

0    False
1     True
2    False
3     True
4    False
dtype: bool

You can also pass regex patterns to the above function for searching more complex values/patterns in the series.

# create a pandas series
balls = pd.Series(['wide', 'no ball', 'wicket', 'dot ball', 'runs'])
# check for wickets or dot balls
good_balls = balls.str.contains('wicket|dot ball', regex=True)
# display good balls
print(good_balls)

Output:

0    False
1    False
2     True
3     True
4    False
dtype: bool

Here we created a pandas series with values representing different outcomes when a blower bowls a ball in cricket. Let’s say we want to find all the good balls which can be defined as either a wicket or a dot ball. We used the regex pattern 'wicket|dot ball' to match with either “wicket” or “dot ball”.

You can similarly write more complex regex patterns depending on your use-case to match values in a pandas series.

For more the pd.Series.str.contains() function, refer to its documentation.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Author

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

Scroll to Top