how do I remove the first instance of a regex capture group from strings in a pandas series?-CodePudding

I want to remove the first instance of a regex capture group from a series of strings in pandas, the inverse of pandas.Series.str.extract. This works fine with extract in the following code, where I extract the first word that begins with a double underscore:

import pandas as pd

regex = r"^(?:.*?)(__\S )"

t = pd.Series(['no match', '__match at start', 'match __in_the middle', 'the end __matches', 'string __with two __captures'])
t.str.extract(regex)

           0
0        NaN
1    __match
2   __in_the
3  __matches
4     __with
dtype: object

but if I do pandas.Series.str.replace with an empty string then it replaces the noncapturing group as well:

t.str.replace(regex, '')

0        no match
1        at start
2          middle
3
4  two __captures
dtype: object

How do I get the following output?

0                no match
1                at start
2           match  middle
3                the end
4  string  two __captures
dtype: object

CodePudding user response：

You can use

t.str.replace(r'__\S ', '', regex=True)

Output:

0         no match
1         at start
2    match  middle
3         the end 
dtype: object

Basically, all you need to do is get rid of the non-capturing group with the pattern inside it. __\S will match __ and then one or more non-whitespace chars and the matched texts will get removed with Series.str.replace code logic.

CodePudding user response：

According to pandas's docs you can specify the number of replacements you want to to make from start. If it works anything like Python's re.sub then you can use @Wiktor Stribiżew code with one modification

t.str.replace(r'__\S ', '', 1, regex=True)

If this not how things work in pandas then

for s in t:
    x = re.sub(r"__\S ", "", s, 1)