I have a large DataFrame (named: complete) of data (only two columns). I want to filter it using complete words only, instead of substrings. Ex:
complete dataframe:
comment | sentiment |
---|---|
fast running | 0.9 |
heavily raining | 0.5 |
in the house | 0.1 |
coming in | 0.0 |
rubbing it | -0.5 |
if I set a substring to filter my table:
substring = 'in'
comp = complete[complete.apply(lambda row: row.astype(str).str.contains(substring, case=False).any(), axis=1)]
output comp:
comment | sentiment |
---|---|
fast running | 0.9 |
heavily raining | 0.5 |
in the house | 0.1 |
coming in | 0.0 |
rubbing it | -0.5 |
It returns the same DF since all words do have "in" as a substring.
My desired output:
comment | sentiment |
---|---|
in the house | 0.1 |
coming in | 0.0 |
Filter it only if the substring is found as a word, not as a substring.
¿How can this be done?
CodePudding user response:
Use word boundaries (\b
) in your pattern:
substring = 'in'
comp = complete[complete['comment'].str.contains(fr'\b{substring}\b', case=False)]
If you have many columns to check, aggregate after (this is much more efficient than apply
with axis=1
):
comp = complete[complete[list_of_cols].apply(lambda s: s.str.contains(fr'\b{substring}\b', case=False)).any(axis=1)]
CodePudding user response:
Another possible solution, which uses lookaround
and \w
means word character and !\w
means no word character. See demonstration
to understand the logic.
sub = 'in'
df.loc[df.comment.str.contains(fr'(?<!\w){sub}(?!\w)'), :]
Output:
comment sentiment
2 in the house 0.1
3 coming in 0.0