Home > Software engineering >  how do I remove the first instance of a regex capture group from strings in a pandas series?
how do I remove the first instance of a regex capture group from strings in a pandas series?

Time:05-03

I want to remove the first instance of a regex capture group from a series of strings in pandas, the inverse of pandas.Series.str.extract. This works fine with extract in the following code, where I extract the first word that begins with a double underscore:

import pandas as pd

regex = r"^(?:.*?)(__\S )"

t = pd.Series(['no match', '__match at start', 'match __in_the middle', 'the end __matches', 'string __with two __captures'])
t.str.extract(regex)
           0
0        NaN
1    __match
2   __in_the
3  __matches
4     __with
dtype: object

but if I do pandas.Series.str.replace with an empty string then it replaces the noncapturing group as well:

t.str.replace(regex, '')
0        no match
1        at start
2          middle
3
4  two __captures
dtype: object

How do I get the following output?

0                no match
1                at start
2           match  middle
3                the end
4  string  two __captures
dtype: object

CodePudding user response:

You can use

t.str.replace(r'__\S ', '', regex=True)

Output:

0         no match
1         at start
2    match  middle
3         the end 
dtype: object

Basically, all you need to do is get rid of the non-capturing group with the pattern inside it. __\S will match __ and then one or more non-whitespace chars and the matched texts will get removed with Series.str.replace code logic.

CodePudding user response:

According to pandas's docs you can specify the number of replacements you want to to make from start. If it works anything like Python's re.sub then you can use @Wiktor Stribiżew code with one modification

t.str.replace(r'__\S ', '', 1, regex=True)

If this not how things work in pandas then

for s in t:
    x = re.sub(r"__\S ", "", s, 1)
  • Related