How does the pd.str.extractall() regex input differ from re.search?-CodePudding

I am attempting to parse a pandas series string into multiple different columns to extract useful information. An example of a string in the original column would look like:

'PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)'

My goal is turn a series of these strings into a dataframe which looks something like this:

Original	Option Type	Underlying Asset	Strike	Expiration
PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)	PUT	PROCTOR & GAMBLE	$140	01/07/22
CALL NETFLIX INC $800 EXP 01/21/22 (EXP)	CALL	NETFLIX	$800	01/21/22
CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP)	CALL	GAMESTOP	$950	01/21/22

I have gotten 80% there by using the code below:

df_sample = ['PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)',
'CALL NETFLIX INC $800 EXP 01/21/22 (EXP)',
'CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP)']

sr = pd.Series(df_sample)

parse = sr.str.extractall('(?P<OptionType>\D{1,4})(?P<UnderlyingAsset>[^$] )(?P<Strike>[$]\d )')
print(parse)

This returns everything but the expiration. However, every attempt I've made at capturing the expiration always returns an empty dataframe. What confuses me is that the regex pattern I use works with the re.search function but not when trying to apply it to pd.str.extractall(). Remaining code below:

parse2 = sr.str.extractall('(?P<OptionType>\D{1,4})(?P<UnderlyingAsset>[^$] )(?P<Strike>[$]\d )(?P<Exp>\d{2}\/\d{2}\/\d{2})') #Returns an empty dataframe

match = re.search(r'\d{2}\/\d{2}\/\d{2}', 'PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)')
print(match) #Returns the date string just fine

Would be great if someone could help me fix this code, and even more if you could explain the differences in how to utilize these functions with regex patterns.

Thanks!

CodePudding user response：

In pandas, Series.str.extract will extract the first match to a df with columns named after the groups in the regex. Series.str.extractall is the same but if there are several matches, you get several rows in the the final dataframe.

As for the expiration, I'm not certain what the issue was on your side, but this worked for me.

import pandas as pd

df_sample = ['PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)',
    'CALL NETFLIX INC $800 EXP 01/21/22 (EXP)',
    'CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP)']

sr = pd.Series(df_sample)

df = sr.str.extract(r'(?P<OptionType>\D{1,4})\s(?P<UnderlyingAsset>[^$] )\s(?P<Strike>\$\d )\sEXP (?P<Expiration>\d\d\/\d\d/\d\d)')

df["Expiration"] = pd.to_datetime(df["Expiration"])

gives

  OptionType   UnderlyingAsset Strike Expiration
0        PUT  PROCTER & GAMBLE   $140 2022-01-07
1       CALL       NETFLIX INC   $800 2022-01-21
2       CALL     GAMESTOP CORP   $950 2022-01-21