I am attempting to parse a pandas series string into multiple different columns to extract useful information. An example of a string in the original column would look like:
'PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)'
My goal is turn a series of these strings into a dataframe which looks something like this:
Original | Option Type | Underlying Asset | Strike | Expiration |
---|---|---|---|---|
PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP) | PUT | PROCTOR & GAMBLE | $140 | 01/07/22 |
CALL NETFLIX INC $800 EXP 01/21/22 (EXP) | CALL | NETFLIX | $800 | 01/21/22 |
CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP) | CALL | GAMESTOP | $950 | 01/21/22 |
I have gotten 80% there by using the code below:
df_sample = ['PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)',
'CALL NETFLIX INC $800 EXP 01/21/22 (EXP)',
'CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP)']
sr = pd.Series(df_sample)
parse = sr.str.extractall('(?P<OptionType>\D{1,4})(?P<UnderlyingAsset>[^$] )(?P<Strike>[$]\d )')
print(parse)
This returns everything but the expiration. However, every attempt I've made at capturing the expiration always returns an empty dataframe. What confuses me is that the regex pattern I use works with the re.search function but not when trying to apply it to pd.str.extractall(). Remaining code below:
parse2 = sr.str.extractall('(?P<OptionType>\D{1,4})(?P<UnderlyingAsset>[^$] )(?P<Strike>[$]\d )(?P<Exp>\d{2}\/\d{2}\/\d{2})') #Returns an empty dataframe
match = re.search(r'\d{2}\/\d{2}\/\d{2}', 'PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)')
print(match) #Returns the date string just fine
Would be great if someone could help me fix this code, and even more if you could explain the differences in how to utilize these functions with regex patterns.
Thanks!
CodePudding user response:
In pandas, Series.str.extract
will extract the first match to a df with columns named after the groups in the regex. Series.str.extractall
is the same but if there are several matches, you get several rows in the the final dataframe.
As for the expiration, I'm not certain what the issue was on your side, but this worked for me.
import pandas as pd
df_sample = ['PUT PROCTER & GAMBLE $140 EXP 01/07/22 (EXP)',
'CALL NETFLIX INC $800 EXP 01/21/22 (EXP)',
'CALL GAMESTOP CORP $950 EXP 01/21/22 (EXP)']
sr = pd.Series(df_sample)
df = sr.str.extract(r'(?P<OptionType>\D{1,4})\s(?P<UnderlyingAsset>[^$] )\s(?P<Strike>\$\d )\sEXP (?P<Expiration>\d\d\/\d\d/\d\d)')
df["Expiration"] = pd.to_datetime(df["Expiration"])
gives
OptionType UnderlyingAsset Strike Expiration
0 PUT PROCTER & GAMBLE $140 2022-01-07
1 CALL NETFLIX INC $800 2022-01-21
2 CALL GAMESTOP CORP $950 2022-01-21