How to use variable(i) from a for loop in regular expression-CodePudding

I am trying to find matches in pandas text column as per my pattern, any word between text: and , . Example:

column	text
text:xyzs,line:yzds,sentence:dhfjdh,	xyzs

try:
    df['text']=df['column'].str.extract(r'text:(. ?),')
except AttributeError:
    flange ['text'] =np.nan

I want to use a for loop to dynamically changing the regex starting pettern. Example replace text to line then sentence.

for i in ['text','line','sentence']:
    df[i] = df['column'].str.extract(r'i:(. ?),')    # This is not working trying to replace text: to i:

Output should be :

column	text	line	sentence
text:xyzs,line:yzds,sentence:dhfjdh,	xyzs	yzds	dhfjdh

CodePudding user response：

You can capture both the pre- and post- separator, then pivot:

out = (df['column']
 .str.extractall(r'([^,:] ):([^,:] )')
 .droplevel(1)
 .pivot(columns=0, values=1)
 #.reindex(list_of_cols, axis=1)  # if needed reindex with a list of wanted terms
)

NB. if you want specific prefixes, you can either incorporate them in the regex (e.g., r'(text|line):([^,:] )'), and/or reindex afterwards.

output:

0  line sentence  text
0  yzds   dhfjdh  xyzs
1  efgh     ijkl  abcd

used input:

                                column
0  text:xyzs,line:yzds,sentence:dhfjdh
1    line:efgh,text:abcd,sentence:ijkl

You can also join the original dataframe:

df.join(out)

output:

                                column  line sentence  text
0  text:xyzs,line:yzds,sentence:dhfjdh  yzds   dhfjdh  xyzs
1    line:efgh,text:abcd,sentence:ijkl  efgh     ijkl  abcd

CodePudding user response：

Another solution:

df = pd.concat(
    [
        df,
        df.apply(
            lambda x: {
                (v := s.split(":"))[0]: v[1]
                for s in map(str.strip, x["column"].split(","))
                if s != ""
            },
            axis=1,
        ).apply(pd.Series),
    ],
    axis=1,
)

print(df)

Prints:

                                 column  text  line sentence
0  text:xyzs,line:yzds,sentence:dhfjdh,  xyzs  yzds   dhfjdh

CodePudding user response：

Yet another option, using f-strings:

import pandas as pd

df = pd.DataFrame({'column': ['text:xyzs,line:yzds,sentence:dhfjdh,',
                              'text:abc,sentence:def,line:xyz12345,',
                              'sentence:abcdef,line:4ta12,text:abc,']})
for i in ['text','line','sentence']:
    df[i] = df['column'].str.extract(fr'\b{i}:([^,] )')

print(df)

Prints:

                                 column  text      line sentence
0  text:xyzs,line:yzds,sentence:dhfjdh,  xyzs      yzds   dhfjdh
1  text:abc,sentence:def,line:xyz12345,   abc  xyz12345      def
2  sentence:abcdef,line:4ta12,text:abc,   abc     4ta12   abcdef