Below is my DF
df = pd.DataFrame({'a' : ['NYC', 'NYC', 'Boston', 'LA', 'SF', 'NYC'], 'b' : ['Other', 'Other', 'NY', 'NUI', 'SD', 'SF']})
a b
0 NYC Other
1 NYC Other
2 Boston NY
3 LA NUI
4 SF SD
5 NYC SF
The aim is to check if list of words is in the df
Below is the code to check for a specific word
word = 'SF'
mask = np.column_stack([df[col].str.contains(word, na=False) for col in df])
df.loc[mask.any(axis=1)]
a b
4 SF SD
5 NYC SF
How can this be performed with list and not one string ?
word = ['SF', 'NY']
CodePudding user response:
You could form a regex alternation from the terms and use that:
word = ['SF', 'NY']
regex = r'^(?:' r'|'.join(word) ')$'
mask = np.column_stack([df[col].str.contains(regex, na=False) for col in df])
df.loc[mask.any(axis=1)]
CodePudding user response:
If you want to match exact words, use isin
combined with any
:
word = ['SF', 'NY']
df[df.isin(word).any(1)]
output:
a b
2 Boston NY
4 SF SD
5 NYC SF
intermediates:
df.isin(word)
a b
0 False False
1 False False
2 False True
3 False False
4 True False
5 False True
df.isin(word).any(1)
0 False
1 False
2 True
3 False
4 True
5 True
dtype: bool
For a regex match combine apply
and str.contains
:
word = ['SF', 'NY']
regex = '|'.join(word)
df[df.apply(lambda c: c.str.contains(regex)).any(1)]
output:
a b
0 NYC Other
1 NYC Other
2 Boston NY
4 SF SD
5 NYC SF