I have a DataFrame with list of strings as below
df
text
,info_concern_blue,replaced_mod,replaced_rad
,info_concern,info_concern_red,replaced_unit
,replaced_link
I want to replace all words after info_concern for eg. info_concern_blue/info_concern_red to info_concern until it encounters comma.
I tried the following regex:
df['replaced_text'] = [re.sub(r'info_concern[^,]*. ?,', 'info_concern,',
x) for x in df['text']]
But this is giving me incorrect results.
Desired output:
replaced_text
,info_concern,replaced_mod,replaced_rad
,info_concern,info_concern,replaced_unit
,replaced_link
Please suggest/advise.
CodePudding user response:
You can use
df['replaced_text'] = df['text'].str.replace(r'(info_concern)[^,]*', r'\1', regex=True)
See the regex demo.
If you want to make sure the match starts right after a comma or start of string, add the (?<![^,])
lookbehind at the start of the pattern:
df['replaced_text'] = df['text'].str.replace(r'(?<![^,])(info_concern)[^,]*', r'\1', regex=True)
See this regex demo. Details:
(?<![^,])
- right before, there should be either,
or start of string(info_concern)
- Group 1:info_concern
string[^,]*
- zero or more chars other than a comma.
The \1
replacement replaces the match with Group 1 value.
CodePudding user response:
The issue is that the pattern info_concern[^,]*. ?,
matches till before the first comma using [^,]*
Then this part . ?,
matches at least a single character (which can also be a comma due to the .
) and then till the next first comma.
So if there is a second comma, it will overmatch and remove too much.
You could also assert info_concern to the left, and match any char except a comma to be removed by an empty string.
If there has to be a comma to the right, you can assert it.
(?<=\binfo_concern)[^,]*(?=,)
The pattern matches:
(?<=\binfo_concern)
Positive lookbehind, assert info_concern to the left[^,]*
Match 0 times any char except,
(?=,)
Positive lookahead, assert,
directly to the right
If the comma is not mandatory, you can omit the lookahead
(?<=\binfo_concern)[^,]*
For example
import pandas as pd
texts = [
",info_concern_blue,replaced_mod,replaced_rad",
",info_concern,info_concern_red,replaced_unit",
",replaced_link"
]
df = pd.DataFrame(texts, columns=["text"])
df['replaced_text'] = df['text'].str.replace(r'(?<=\binfo_concern)[^,]*(?=,)', '', regex=True)
print(df)
Output
text replaced_text
0 ,info_concern_blue,replaced_mod,replaced_rad ,info_concern,replaced_mod,replaced_rad
1 ,info_concern,info_concern_red,replaced_unit ,info_concern,info_concern,replaced_unit
2 ,replaced_link ,replaced_link