given this df:
data = {'Description': ['with milk and orange', 'champagne', 'BANANA', 'bananas and apple', 'fafsa Lemons', 'GIN LEMON'],
'Amount': ['10', '20', '10', '5', '9', '15']}
df = pd.DataFrame(data)
print (df)
and the following vector:
Fruits = ['apple','banana','lemon', 'orange']
how to obtain the column 'Fruit'? (search for all the elements of the vector Fruits in the column description and add them in the Column 'fruit' if contained in Description)
datanew = {'Description': ['with milk and orange', 'champagne', 'BANANA', 'bananas and apple', 'fafsa Lemons', 'GIN LEMON'],
'Amount': ['10', '20', '10', '5', '9', '15'],
'Fruit': ['orange', '', 'banana', 'banana-apple', 'lemon', 'lemon'],
}
df2 = pd.DataFrame(datanew)
print (df2)
CodePudding user response:
You can use str.extractall
and groupby.agg
:
import re
df['Fruit'] = (df['Description']
.str.extractall(f"({'|'.join(Fruits)})", flags=re.I)
.groupby(level=0).agg('-'.join)[0]
.str.lower()
)
output:
Description Amount Fruit
0 with milk and orange 10 orange
1 champagne 20 NaN
2 BANANA 10 banana
3 bananas and apple 5 banana-apple
4 fafsa Lemons 9 lemon
5 GIN LEMON 15 lemon
CodePudding user response:
Here is another way by using str.findall()
(df.assign(Fruit = df['Description'].str.lower()
.str.findall('|'.join(Fruits))
.str.join('-')))
Description Amount Fruit
0 with milk and orange 10 orange
1 champagne 20
2 BANANA 10 banana
3 bananas and apple 5 banana-apple
4 fafsa Lemons 9 lemon
5 GIN LEMON 15 lemon