My data frame looks like this-
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660
1 AXISBANK26May2022 CE 690
2 AXISBANK26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300
My desired output-
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 660 CE
1 AXISBANK26May2022 CE 690 26May2022 690 CE
2 AXISBANK26May2022 PE 670 26May2022 670 PE
3 BANKNIFTY19May2022 PE 30200 19May2022 30200 PE
4 BANKNIFTY19May2022 PE 31200 19May2022 31200 PE
5 BANKNIFTY26May2022 PE 34300 26May2022 34300 PE
I tried like this-
df['Expiry]= df['CONTRACT'].str.extract(r'(\d{2}\D{3}\d{4})')
df['Strike_price']= df['CONTRACT'].str.extract(r'(\d{5})')
df['Option_type']= df['CONTRACT'].str.extract(r'(\D\D)')
Please Help to find the correct columns without Space. Thanks
CodePudding user response:
One option is with str.extract
:
pattern = r"[A-Z] (\d \D \d )\s ([A-Z] )\s (\d )"
extracts = df.CONTRACT.str.extract(pattern)
extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)
df.assign(**extracts)
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 AXISBANK26May2022 CE 690 26May2022 CE 690
2 AXISBANK26May2022 PE 670 26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200 19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300 26May2022 PE 34300
Another approach is with str.split
on a regex, but it is a longer approach, and prone to more errors I suspect :
extracts = (df
.CONTRACT
.str.split(r"(\d \D \d )|\s ", expand = True)
.dropna(how = 'all', axis = 1)
.loc[:, lambda df: df.ne('').any()]
.iloc[:, 1:])
extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)
df.assign(**extracts)
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 AXISBANK26May2022 CE 690 26May2022 CE 690
2 AXISBANK26May2022 PE 670 26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200 19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300 26May2022 PE 34300
CodePudding user response:
You can use
df[['Expiry','Option_type','Stike_price']] = df['CONTRACT'].str.extract(r'(\d{2}[^\W\d]{3}\d{4})\s ([A-Z] )\s (\d )$', expand=True)
See the regex demo. Details:
(\d{2}[^\W\d]{3}\d{4})
- Group 1: two digits, three letters/underscore, four digits\s
- one or more whitespaces([A-Z] )
- Group 2: one or more uppercase ASCII letters\s
- one or more whitespaces(\d )
- Group 3: one or more digits$
- end of string.
Or, you may utilize named capturing groups to extract the data into a separate dataframe with column names already defined and in the expected order, and then merge the two:
import pandas as pd
df = pd.DataFrame({'CONTRACT':['AXISBANK26May2022 CE 660','BANKNIFTY19May2022 PE 30200']})
df_extr = df['CONTRACT'].str.extract(r'(?P<Expiry>\d{2}[^\W\d]{3}\d{4})\s (?P<Stike_price>[A-Z] )\s (?P<Option_type>\d )$', expand=True)
df = df.merge(df_extr, left_index=True, right_index=True)
Output:
>>> df.merge(df_extr, left_index=True, right_index=True)
CONTRACT Expiry Stike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200