Home > Mobile >  How to exclude prefixed substrings from column values
How to exclude prefixed substrings from column values

Time:02-15

My goal is to exclude the rows prefixed by No, for example I'm interested by the word commerce, then I have to keep all rows contains commerce and exclude no commerce. I'm looking for an generic solution

data = {
'description':[
    'Commerce foo', 
    'foo commercial',
    'foo no commercial',
    'foo commercial boo',
    'No commerce',
    'foo restaurant',
    'boo no restaurant']
}

df = pd.DataFrame(data)

    ______________________
    | description        |
__________________________
| 0 | Commerce foo       |
| 1 | foo commercial     |
| 2 | foo no commercial  |
| 3 | foo commercial boo |
| 4 | No commerce        |
| 5 | foo restaurant     |
| 6 | boo no restaurant  |
__________________________

I tried (not working)

search = ['restaurant', 'commerce', 'commercial']

df['description'].str.findall(f'̂̂̂^(?<=no\s)({"|".join(search)})', flags=re.IGNORECASE)

What I'm looking for

    ______________________
    | description        |
__________________________
| 0 | Commerce foo       |
| 1 | foo commercial     |
| 3 | foo commercial boo |
| 5 | foo restaurant     |
__________________________

CodePudding user response:

The pattern currently starts with ^(?<=no\s) which asserts that from the start of the string there should be no followed by a whitespace char directly to the left. That will not match.

If you want to filter the rows, you can use str.contains instead of findall.

You can remove the anchor ^ and change the assertion to (?<!no\s) what will assert that there is not no followed by a whitespace char directly to the left and add word boundaries to the left and right to prevent partial word matches.

For a match only, you can change the capture group to a non capturing one (?:

If you print the pattern, it will look like

\b(?<!no\s)(?:restaurant|commerce|commercial)\b

See a regex demo

data = {
    'description':[
        'Commerce foo',
        'foo commercial',
        'foo no commercial',
        'foo commercial boo',
        'No commerce',
        'foo restaurant',
        'boo no restaurant']
}

df = pd.DataFrame(data)
search = ['restaurant', 'commerce', 'commercial']
print(df[df['description'].str.contains(fr'\b(?<!no\s)(?:{"|".join(search)})\b', flags=re.IGNORECASE)])

Output

          description
0        Commerce foo
1      foo commercial
3  foo commercial boo
5      foo restaurant
  • Related