I want to remove the first instance of a regex capture group from a series of strings in pandas, the inverse of pandas.Series.str.extract
. This works fine with extract
in the following code, where I extract the first word that begins with a double underscore:
import pandas as pd
regex = r"^(?:.*?)(__\S )"
t = pd.Series(['no match', '__match at start', 'match __in_the middle', 'the end __matches', 'string __with two __captures'])
t.str.extract(regex)
0
0 NaN
1 __match
2 __in_the
3 __matches
4 __with
dtype: object
but if I do pandas.Series.str.replace
with an empty string then it replaces the noncapturing group as well:
t.str.replace(regex, '')
0 no match
1 at start
2 middle
3
4 two __captures
dtype: object
How do I get the following output?
0 no match
1 at start
2 match middle
3 the end
4 string two __captures
dtype: object
CodePudding user response:
You can use
t.str.replace(r'__\S ', '', regex=True)
Output:
0 no match
1 at start
2 match middle
3 the end
dtype: object
Basically, all you need to do is get rid of the non-capturing group with the pattern inside it. __\S
will match __
and then one or more non-whitespace chars and the matched texts will get removed with Series.str.replace
code logic.
CodePudding user response:
According to pandas's docs you can specify the number of replacements you want to to make from start. If it works anything like Python's re.sub then you can use @Wiktor Stribiżew code with one modification
t.str.replace(r'__\S ', '', 1, regex=True)
If this not how things work in pandas then
for s in t:
x = re.sub(r"__\S ", "", s, 1)