How to Get sub string of a column in onother column of pandas dataframe using regex-CodePudding

My data frame looks like this-

CONTRACT                                Expiry   Strike_price  Option_type

0     AXISBANK26May2022 CE 660
1     AXISBANK26May2022 CE 690
2     AXISBANK26May2022 PE 670
3     BANKNIFTY19May2022 PE 30200
4     BANKNIFTY19May2022 PE 31200
5     BANKNIFTY26May2022 PE 34300

My desired output-

CONTRACT                                Expiry   Strike_price  Option_type

0     AXISBANK26May2022 CE 660         26May2022   660            CE
1     AXISBANK26May2022 CE 690         26May2022   690            CE
2     AXISBANK26May2022 PE 670         26May2022   670            PE
3     BANKNIFTY19May2022 PE 30200      19May2022   30200          PE
4     BANKNIFTY19May2022 PE 31200      19May2022   31200          PE
5     BANKNIFTY26May2022 PE 34300      26May2022   34300          PE

I tried like this-

df['Expiry]= df['CONTRACT'].str.extract(r'(\d{2}\D{3}\d{4})')
df['Strike_price']= df['CONTRACT'].str.extract(r'(\d{5})')
df['Option_type']= df['CONTRACT'].str.extract(r'(\D\D)')

Please Help to find the correct columns without Space. Thanks

CodePudding user response：

One option is with str.extract:

pattern = r"[A-Z] (\d \D \d )\s ([A-Z] )\s (\d )"
extracts = df.CONTRACT.str.extract(pattern)
extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)

df.assign(**extracts)

                      CONTRACT     Expiry Strike_price Option_type
0     AXISBANK26May2022 CE 660  26May2022           CE         660
1     AXISBANK26May2022 CE 690  26May2022           CE         690
2     AXISBANK26May2022 PE 670  26May2022           PE         670
3  BANKNIFTY19May2022 PE 30200  19May2022           PE       30200
4  BANKNIFTY19May2022 PE 31200  19May2022           PE       31200
5  BANKNIFTY26May2022 PE 34300  26May2022           PE       34300

Another approach is with str.split on a regex, but it is a longer approach, and prone to more errors I suspect :

extracts = (df
            .CONTRACT
            .str.split(r"(\d \D \d )|\s ", expand = True)
            .dropna(how = 'all', axis = 1)
            .loc[:, lambda df: df.ne('').any()]
            .iloc[:, 1:])

extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)

df.assign(**extracts)

                      CONTRACT     Expiry Strike_price Option_type
0     AXISBANK26May2022 CE 660  26May2022           CE         660
1     AXISBANK26May2022 CE 690  26May2022           CE         690
2     AXISBANK26May2022 PE 670  26May2022           PE         670
3  BANKNIFTY19May2022 PE 30200  19May2022           PE       30200
4  BANKNIFTY19May2022 PE 31200  19May2022           PE       31200
5  BANKNIFTY26May2022 PE 34300  26May2022           PE       34300

CodePudding user response：

You can use

df[['Expiry','Option_type','Stike_price']] = df['CONTRACT'].str.extract(r'(\d{2}[^\W\d]{3}\d{4})\s ([A-Z] )\s (\d )$', expand=True)

See the regex demo. Details:

(\d{2}[^\W\d]{3}\d{4}) - Group 1: two digits, three letters/underscore, four digits
\s - one or more whitespaces
([A-Z] ) - Group 2: one or more uppercase ASCII letters
\s - one or more whitespaces
(\d ) - Group 3: one or more digits
$ - end of string.

Or, you may utilize named capturing groups to extract the data into a separate dataframe with column names already defined and in the expected order, and then merge the two:

import pandas as pd
df = pd.DataFrame({'CONTRACT':['AXISBANK26May2022 CE 660','BANKNIFTY19May2022 PE 30200']})
df_extr = df['CONTRACT'].str.extract(r'(?P<Expiry>\d{2}[^\W\d]{3}\d{4})\s (?P<Stike_price>[A-Z] )\s (?P<Option_type>\d )$', expand=True)
df = df.merge(df_extr, left_index=True, right_index=True)

Output:

>>> df.merge(df_extr, left_index=True, right_index=True)
                      CONTRACT     Expiry Stike_price Option_type
0     AXISBANK26May2022 CE 660  26May2022          CE         660
1  BANKNIFTY19May2022 PE 30200  19May2022          PE       30200