My goal is to exclude the rows prefixed by No, for example I'm interested by the word commerce, then I have to keep all rows contains commerce and exclude no commerce. I'm looking for an generic solution
data = {
'description':[
'Commerce foo',
'foo commercial',
'foo no commercial',
'foo commercial boo',
'No commerce',
'foo restaurant',
'boo no restaurant']
}
df = pd.DataFrame(data)
______________________
| description |
__________________________
| 0 | Commerce foo |
| 1 | foo commercial |
| 2 | foo no commercial |
| 3 | foo commercial boo |
| 4 | No commerce |
| 5 | foo restaurant |
| 6 | boo no restaurant |
__________________________
I tried (not working)
search = ['restaurant', 'commerce', 'commercial']
df['description'].str.findall(f'̂̂̂^(?<=no\s)({"|".join(search)})', flags=re.IGNORECASE)
What I'm looking for
______________________
| description |
__________________________
| 0 | Commerce foo |
| 1 | foo commercial |
| 3 | foo commercial boo |
| 5 | foo restaurant |
__________________________
CodePudding user response:
The pattern currently starts with ^(?<=no\s)
which asserts that from the start of the string there should be no
followed by a whitespace char directly to the left. That will not match.
If you want to filter the rows, you can use str.contains instead of findall.
You can remove the anchor ^
and change the assertion to (?<!no\s)
what will assert that there is not no
followed by a whitespace char directly to the left and add word boundaries to the left and right to prevent partial word matches.
For a match only, you can change the capture group to a non capturing one (?:
If you print the pattern, it will look like
\b(?<!no\s)(?:restaurant|commerce|commercial)\b
See a regex demo
data = {
'description':[
'Commerce foo',
'foo commercial',
'foo no commercial',
'foo commercial boo',
'No commerce',
'foo restaurant',
'boo no restaurant']
}
df = pd.DataFrame(data)
search = ['restaurant', 'commerce', 'commercial']
print(df[df['description'].str.contains(fr'\b(?<!no\s)(?:{"|".join(search)})\b', flags=re.IGNORECASE)])
Output
description
0 Commerce foo
1 foo commercial
3 foo commercial boo
5 foo restaurant