have been trying for a good while now and cannot find an answer online, so... I'm sure someone can help.
I have a dataframe with a column that contains descriptive text, e.g.
"BALANCE SHRINKER - CORE"
Each row has a different text value.
I need to check for the existence of any of multiple words:
['LOB','LIFE','SHRINKER']
say.
And from the result (True/False), create a new column set to 999 if any phrase is found in the text column being searched, or set to 0 otherwise.
I have tried this kind of approach but nothing works for me:
df['rule1'] = 999 if any(x in df['textcolumn'].str for x in ['LOB','LIFE','SHRINKER']) else 0
I've tried .find()
and .contains()
but to no avail.
So, I'm sure someone can advise!
Thanks for looking.
DT
CodePudding user response:
Use Series.str.contains
to check if each row of 'textcolumn' contains any of the words, producing a boolean Series. Then use Series.map
to map the True values to 900, and the False values to 0.
# list of words to find in 'textcolumn'
words = ['LOB','LIFE','SHRINKER']
# regex pattern to search in 'textcolumn'
# '|' stands for OR. Read pat as "match 'LOB' OR 'LIFE' OR 'SHRINKER'"
pat = "|".join(words)
df['rule1'] = df['textcolumn'].str.contains(pat).map({True: 999, False: 0})
Another option is to use numpy.where
import numpy as np
words = ['LOB','LIFE','SHRINKER']
pat = "|".join(words)
df['rule1'] = np.where(df['textcolumn'].str.contains(pat), 999, 0)