I have a dataframe with upwards of 10,000 entries of retail items. For each of them the item name follows a standard naming convention: quantity, unit, brand name, item e.g: 100 rolls Cushelle Toilet Paper
I also have a list of the brand names: [Cushelle, Charmin, Tesco, Sainsburys]
I want to go through the list of items and remove everything before and including the brand name, so I have a column in my dataframe of items. how should i go about it?
data['item_name']
1 Toilet Paper
2 Bleach
3 Antibacterial Spray
I have tried the following:
variation = ['Cushelle', 'Charmin', 'Tesco', 'Sainsburys']
data['item_name'] = data['item'].str.replace(^.*?'(?=variations), '')
this errored, but when trying it with a singular item it worked exactly like how i needed, e.g.:
data['item_name'] = data['item'].str.replace(^.*?(?=Charmin)', '')
and
data['item_name'] = data['item'].str.replace('|'.join(variations), '')
this only removed the brand name and kept in the quantity and unit which i wanted to remove.
CodePudding user response:
Here you go:
variations = ['Cushelle', 'Charmin', 'Tesco', 'Sainsburys']
variations = '|'.join(variations)
df['item_name'] = df['item'].str.split(variations, n=1).str.get(-1)
CodePudding user response:
I would use str.extract
:
import re
variation = ['Cushelle', 'Charmin', 'Tesco', 'Sainsburys']
pattern = "|".join(map(re.escape, variation))
data['item_name'] = data['item'].str.extract(fr'(?:{pattern})\s*(.*)')
Example:
item item_name
0 100 rolls Cushelle Toilet Paper Toilet Paper
1 50 rolls Charmin Toilet Paper Toilet Paper
CodePudding user response:
If the quantitiy, unit, and brand name are always one word each, you can exploit that by splitting the items into word lists, leaving out the first three elements of each list, and joining the rest together again:
data['item_name'] = data.item.str.split(' ').map(lambda x: x[3:]).str.join(' ')