I have DF that looks like below:
columna_1 column_2
1 I am Thomas
2 Are you Thomas
3 How are you?
4 I am fine...
5 Jack, what's up?
and I have a list like this:
names = ["Thomas", "Jack"]
What I need is to find values from list names if there are in column_2
and if there are, it should add new column with the values from list. So it should be like this:
columna_1 column_2 column_3
1 I am Thomas Thomas
2 Are you Thomas Thomas
5 Jack, what's up? Jack
I have started with finding just values but I am not sure how I could add a new column with correct values. Below is my code:
df[df['column_2'].astype("str").apply(lambda x: set(names).issubset(x))]
Do you have any idea? Thanks for help!
CodePudding user response:
Use a regex:
import re
regex = fr'\b({"|".join(map(re.escape, names))})\b'
df['column_3'] = df['column_2'].str.extract(regex, expand=False)
To also drop the non matches:
import re
regex = fr'\b({"|".join(map(re.escape,names))})\b'
(df.assign(column_3=df['column_2'].str.extract(regex, expand=False))
.dropna(subset=['column_3'])
)
output:
columna_1 column_2 column_3
0 1 I am Thomas Thomas
1 2 Are you Thomas Thomas
4 5 Jack, what's up? Jack
NB. Note that in case both Thomas and Jack are in the sentence, only the first one in the sentence will match