Home > Blockchain >  How do I prevent str.contains() from searching for a sub-string?
How do I prevent str.contains() from searching for a sub-string?

Time:01-11

I want Pandas to search my data frame for the complete string and not a sub-string. Here is a minimal-working example to explain my problem -

data = [['tom', 'wells fargo', 'retired'], ['nick', 'bank of america', 'partner'], ['juli', 'chase', 'director - oil well']]
 
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
 
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(val, case=False)).any(axis="columns")]

The correct code would have only returned the second row and not the first one

    Name    Place   Position
0   tom wells fargo retired
2   juli    chase   director - oil well

Update - My intention is to have a search that looks for the exact string requested. While looking for "well" the algorithm shouldn't extract out "well. Based on the comments, I understand how my question might be misleading.

CodePudding user response:

IIUC, you can use:

>>> df[~df['Position'].str.contains(fr'\b{val}\b')]

   Name        Place             Position
0   tom  wells fargo              retired
2  juli        chase  director - oil well

And for all columns:

>>> df[~df.apply(lambda x: x.str.contains(fr'\b{val}\b', case=False)).any(axis=1)]

   Name        Place             Position
0   tom  wells fargo              retired
2  juli        chase  director - oil well

CodePudding user response:

The regular expression anchor \b which is a word boundary is what you want.

I added addtional data to your code to illustrate more:

import pandas as pd
data = [
          ['tom', 'wells fargo', 'retired']
        , ['nick', 'bank of america', 'partner']
        , ['john','bank of welly','blah']
        , ['jan','bank of somewell knwon','well that\'s it']
        , ['juli', 'chase', 'director - oil well']
        ]
 
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
 
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(fr"\b{val}\b", case=False)).any(axis="columns")]

EDIT In Python3 the string can be substitute with the variable with f in front of " or ' and r will express it as regular expression. Then now you can get the val as you want. Thank @smci

and the output is like this

Name Place Position
3 jan bank of somewell knwon well that's it
4 juli chase director - oil well
  • Related