Home > other >  filter a DataFrame using complete word only
filter a DataFrame using complete word only

Time:12-07

I have a large DataFrame (named: complete) of data (only two columns). I want to filter it using complete words only, instead of substrings. Ex:

complete dataframe:

comment sentiment
fast running 0.9
heavily raining 0.5
in the house 0.1
coming in 0.0
rubbing it -0.5

if I set a substring to filter my table:

substring = 'in'
comp = complete[complete.apply(lambda row: row.astype(str).str.contains(substring, case=False).any(), axis=1)]

output comp:

comment sentiment
fast running 0.9
heavily raining 0.5
in the house 0.1
coming in 0.0
rubbing it -0.5

It returns the same DF since all words do have "in" as a substring.

My desired output:

comment sentiment
in the house 0.1
coming in 0.0

Filter it only if the substring is found as a word, not as a substring.

¿How can this be done?

CodePudding user response:

Use word boundaries (\b) in your pattern:

substring = 'in'
comp = complete[complete['comment'].str.contains(fr'\b{substring}\b', case=False)]

If you have many columns to check, aggregate after (this is much more efficient than apply with axis=1):

comp = complete[complete[list_of_cols].apply(lambda s: s.str.contains(fr'\b{substring}\b', case=False)).any(axis=1)]

CodePudding user response:

Another possible solution, which uses lookaround and \w means word character and !\w means no word character. See demonstration to understand the logic.

sub = 'in'
df.loc[df.comment.str.contains(fr'(?<!\w){sub}(?!\w)'), :]

Output:

        comment  sentiment
2  in the house        0.1
3     coming in        0.0
  • Related