How to match and replace everything after a specific word until it reaches comma in a list of string-CodePudding

I have a DataFrame with list of strings as below

df

text
,info_concern_blue,replaced_mod,replaced_rad
,info_concern,info_concern_red,replaced_unit
,replaced_link

I want to replace all words after info_concern for eg. info_concern_blue/info_concern_red to info_concern until it encounters comma.

I tried the following regex:

df['replaced_text'] =  [re.sub(r'info_concern[^,]*. ?,', 'info_concern,',
           x) for x in df['text']]

But this is giving me incorrect results.

Desired output:

replaced_text

    ,info_concern,replaced_mod,replaced_rad
    ,info_concern,info_concern,replaced_unit
    ,replaced_link

Please suggest/advise.

CodePudding user response：

You can use

df['replaced_text'] = df['text'].str.replace(r'(info_concern)[^,]*', r'\1', regex=True)

See the regex demo.

If you want to make sure the match starts right after a comma or start of string, add the (?<![^,]) lookbehind at the start of the pattern:

df['replaced_text'] = df['text'].str.replace(r'(?<![^,])(info_concern)[^,]*', r'\1', regex=True)

See this regex demo. Details:

(?<![^,]) - right before, there should be either , or start of string
(info_concern) - Group 1: info_concern string
[^,]* - zero or more chars other than a comma.

The \1 replacement replaces the match with Group 1 value.

CodePudding user response：

The issue is that the pattern info_concern[^,]*. ?, matches till before the first comma using [^,]*

Then this part . ?, matches at least a single character (which can also be a comma due to the .) and then till the next first comma.

So if there is a second comma, it will overmatch and remove too much.

You could also assert info_concern to the left, and match any char except a comma to be removed by an empty string.

If there has to be a comma to the right, you can assert it.

(?<=\binfo_concern)[^,]*(?=,)

The pattern matches:

(?<=\binfo_concern) Positive lookbehind, assert info_concern to the left
[^,]* Match 0 times any char except ,
(?=,) Positive lookahead, assert , directly to the right

Regex demo

If the comma is not mandatory, you can omit the lookahead

(?<=\binfo_concern)[^,]*

For example

import pandas as pd

texts = [
    ",info_concern_blue,replaced_mod,replaced_rad",
    ",info_concern,info_concern_red,replaced_unit",
    ",replaced_link"
]

df = pd.DataFrame(texts, columns=["text"])
df['replaced_text'] = df['text'].str.replace(r'(?<=\binfo_concern)[^,]*(?=,)', '', regex=True)

print(df)

Output

                                           text                             replaced_text
0  ,info_concern_blue,replaced_mod,replaced_rad   ,info_concern,replaced_mod,replaced_rad
1  ,info_concern,info_concern_red,replaced_unit  ,info_concern,info_concern,replaced_unit
2                                ,replaced_link                            ,replaced_link