How to replace everything before and including substring in a dataframe found in a list-CodePudding

I have a dataframe with upwards of 10,000 entries of retail items. For each of them the item name follows a standard naming convention: quantity, unit, brand name, item e.g: 100 rolls Cushelle Toilet Paper

I also have a list of the brand names: [Cushelle, Charmin, Tesco, Sainsburys]

I want to go through the list of items and remove everything before and including the brand name, so I have a column in my dataframe of items. how should i go about it?

data['item_name']
1 Toilet Paper
2 Bleach
3 Antibacterial Spray

I have tried the following:

variation = ['Cushelle', 'Charmin', 'Tesco', 'Sainsburys']
data['item_name'] = data['item'].str.replace(^.*?'(?=variations), '')

this errored, but when trying it with a singular item it worked exactly like how i needed, e.g.:

data['item_name'] = data['item'].str.replace(^.*?(?=Charmin)', '')

and

data['item_name'] = data['item'].str.replace('|'.join(variations), '')

this only removed the brand name and kept in the quantity and unit which i wanted to remove.

CodePudding user response：

Here you go:

variations = ['Cushelle', 'Charmin', 'Tesco', 'Sainsburys']
variations =  '|'.join(variations)
df['item_name'] = df['item'].str.split(variations, n=1).str.get(-1)

CodePudding user response：

I would use str.extract:

import re

variation = ['Cushelle', 'Charmin', 'Tesco', 'Sainsburys']
pattern = "|".join(map(re.escape, variation))

data['item_name'] = data['item'].str.extract(fr'(?:{pattern})\s*(.*)')

Example:

                              item     item_name
0  100 rolls Cushelle Toilet Paper  Toilet Paper
1    50 rolls Charmin Toilet Paper  Toilet Paper

CodePudding user response：

If the quantitiy, unit, and brand name are always one word each, you can exploit that by splitting the items into word lists, leaving out the first three elements of each list, and joining the rest together again:

data['item_name'] = data.item.str.split(' ').map(lambda x: x[3:]).str.join(' ')