TIL: Use a Regular Expression (RegEx) with Pandas


Mirza S. Khan


December 27, 2023

For a project I am working on, I wanted to identify mentions of brain natriuretic peptide (BNP) or N-terminal pro b-type natriuretic peptide (NT-proBNP) in a column containing clinical trial outcomes. I reviewed a bunch of the ones provided and the best “catch all” regular expression was (BNP|(?i)natriuretic peptide), where (?i) provides case-insensitive matching.

I’ve slowly been migrating over to polars, but Pandas has a need findall() method that accepts RegEx patterns that I ended up using. Below is a simple example to highlight this functionality.

import pandas as pd

# RegEx pattern
pattern_bnp = r'(BNP|(?i)natriuretic peptide)'

# Simple example df
data = pd.DataFrame({
    "outcomes": ['This has KCCQ', 'This has NT-proBNP', None, 'This has Seattle Angina Questionnaire', 'bnp']
0                          This has KCCQ
1                     This has NT-proBNP
2                                   None
3  This has Seattle Angina Questionnaire
4                                    bnp

We can apply our RegEx pattern to identify which ones contain BNP/NT-proBNP as an outcome in the ‘outcomes’ column. In this case, I’ve used the findall() and contains() (uses re.search) methods here, but others include:

# findall
0       []
1    [BNP]
2     None
3       []
4    [bnp]
Name: outcomes, dtype: object
# contains
data['outcomes'].str.contains(pattern_bnp, regex = True, na=False).astype(int)
0    0
1    1
2    0
3    0
4    1
Name: outcomes, dtype: int64

<string>:3: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.