I am trying to find matches in pandas text column as per my pattern, any word between text: and , . Example:
column | text |
---|---|
text:xyzs,line:yzds,sentence:dhfjdh, | xyzs |
try:
df['text']=df['column'].str.extract(r'text:(. ?),')
except AttributeError:
flange ['text'] =np.nan
I want to use a for loop to dynamically changing the regex starting pettern. Example replace text to line then sentence.
for i in ['text','line','sentence']:
df[i] = df['column'].str.extract(r'i:(. ?),') # This is not working trying to replace text: to i:
Output should be :
column | text | line | sentence |
---|---|---|---|
text:xyzs,line:yzds,sentence:dhfjdh, | xyzs | yzds | dhfjdh |
CodePudding user response:
You can capture both the pre- and post- separator, then pivot
:
out = (df['column']
.str.extractall(r'([^,:] ):([^,:] )')
.droplevel(1)
.pivot(columns=0, values=1)
#.reindex(list_of_cols, axis=1) # if needed reindex with a list of wanted terms
)
NB. if you want specific prefixes, you can either incorporate them in the regex (e.g., r'(text|line):([^,:] )'
), and/or reindex
afterwards.
output:
0 line sentence text
0 yzds dhfjdh xyzs
1 efgh ijkl abcd
used input:
column
0 text:xyzs,line:yzds,sentence:dhfjdh
1 line:efgh,text:abcd,sentence:ijkl
You can also join
the original dataframe:
df.join(out)
output:
column line sentence text
0 text:xyzs,line:yzds,sentence:dhfjdh yzds dhfjdh xyzs
1 line:efgh,text:abcd,sentence:ijkl efgh ijkl abcd
CodePudding user response:
Another solution:
df = pd.concat(
[
df,
df.apply(
lambda x: {
(v := s.split(":"))[0]: v[1]
for s in map(str.strip, x["column"].split(","))
if s != ""
},
axis=1,
).apply(pd.Series),
],
axis=1,
)
print(df)
Prints:
column text line sentence
0 text:xyzs,line:yzds,sentence:dhfjdh, xyzs yzds dhfjdh
CodePudding user response:
Yet another option, using f-strings:
import pandas as pd
df = pd.DataFrame({'column': ['text:xyzs,line:yzds,sentence:dhfjdh,',
'text:abc,sentence:def,line:xyz12345,',
'sentence:abcdef,line:4ta12,text:abc,']})
for i in ['text','line','sentence']:
df[i] = df['column'].str.extract(fr'\b{i}:([^,] )')
print(df)
Prints:
column text line sentence
0 text:xyzs,line:yzds,sentence:dhfjdh, xyzs yzds dhfjdh
1 text:abc,sentence:def,line:xyz12345, abc xyz12345 def
2 sentence:abcdef,line:4ta12,text:abc, abc 4ta12 abcdef