I want to create a new column in a dataframe that should have one keyword from a list of keywords based on which keyword appears in another (description) column. If the description column has multiple keywords, I need to pick the first match. I then need to add the values of the unique keywords. Here is my attempt at solving this but I can't seem to solve the multiple matches issue. Can someone please help?
kw = ['alpha', 'beta', 'theta', 'delta']
data = {'description':['This text contains alpha', 'Here are delta & beta', 'It is beta', 'Another Alpha', 'sometimes Theta too', 'One more BETA'],
'value': [100,200,300,400,500,600]}
df = pd.DataFrame(data)
#add column based on which keyword appears in description
df['keys'] = df['description'].str.lower().str.findall('|'.join(kw)).apply(set).str.join(',') #Is there a simpler way to code this?
print(f"new df = \n{df}\n")
#add values of unique keywords
df2 = df.groupby('keys').sum()
print(f"with key values = \n{df2}\n")
Output:
new df =
description value keys
0 This text contains alpha 100 alpha
1 Here are delta & beta 200 delta,beta
2 It is beta 300 beta
3 Another Alpha 400 alpha
4 sometimes Theta too 500 theta
5 One more BETA 600 beta
with key values =
value
keys
alpha 500
beta 900
delta,beta 200
theta 500
CodePudding user response:
You can use this:
df['keys'] = df['description'].str.lower().str.findall('|'.join(kw)).str[0]
Instead of applying a set (will order alphabetical) it will keep the greek letter that appears first. Using .str[0]
you can get the first element of all lists in the column.