Home > database >  How does the pd.str.extractall() regex input differ from re.search?
How does the pd.str.extractall() regex input differ from re.search?

Time:08-13

I am attempting to parse a pandas series string into multiple different columns to extract useful information. An example of a string in the original column would look like:

'PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)'

My goal is turn a series of these strings into a dataframe which looks something like this:

Original Option Type Underlying Asset Strike Expiration
PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP) PUT PROCTOR & GAMBLE $140 01/07/22
CALL NETFLIX INC $800 EXP 01/21/22 (EXP) CALL NETFLIX $800 01/21/22
CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP) CALL GAMESTOP $950 01/21/22

I have gotten 80% there by using the code below:

df_sample = ['PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)',
'CALL NETFLIX INC $800 EXP 01/21/22 (EXP)',
'CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP)']

sr = pd.Series(df_sample)

parse = sr.str.extractall('(?P<OptionType>\D{1,4})(?P<UnderlyingAsset>[^$] )(?P<Strike>[$]\d )')
print(parse)

This returns everything but the expiration. However, every attempt I've made at capturing the expiration always returns an empty dataframe. What confuses me is that the regex pattern I use works with the re.search function but not when trying to apply it to pd.str.extractall(). Remaining code below:

parse2 = sr.str.extractall('(?P<OptionType>\D{1,4})(?P<UnderlyingAsset>[^$] )(?P<Strike>[$]\d )(?P<Exp>\d{2}\/\d{2}\/\d{2})') #Returns an empty dataframe

match = re.search(r'\d{2}\/\d{2}\/\d{2}', 'PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)')
print(match) #Returns the date string just fine

Would be great if someone could help me fix this code, and even more if you could explain the differences in how to utilize these functions with regex patterns.

Thanks!

CodePudding user response:

In pandas, Series.str.extract will extract the first match to a df with columns named after the groups in the regex. Series.str.extractall is the same but if there are several matches, you get several rows in the the final dataframe.

As for the expiration, I'm not certain what the issue was on your side, but this worked for me.

import pandas as pd

df_sample = ['PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)',
    'CALL NETFLIX INC $800 EXP 01/21/22 (EXP)',
    'CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP)']

sr = pd.Series(df_sample)

df = sr.str.extract(r'(?P<OptionType>\D{1,4})\s(?P<UnderlyingAsset>[^$] )\s(?P<Strike>\$\d )\sEXP (?P<Expiration>\d\d\/\d\d/\d\d)')

df["Expiration"] = pd.to_datetime(df["Expiration"])

gives

  OptionType   UnderlyingAsset Strike Expiration
0        PUT  PROCTER & GAMBLE   $140 2022-01-07
1       CALL       NETFLIX INC   $800 2022-01-21
2       CALL     GAMESTOP CORP   $950 2022-01-21
  • Related