I have the following dataframe:
df = pd.DataFrame()
data = {'Description':['CERVEZA DORADA BOTELLA NR 24 UNIDADES 350ML', 'BEBIDA DE ALMENDRA COCO SILK 1/6/946 ML WHITE WAVE (1788 UNIDADES)', 'ADES SOYA ORAN TETRA 200MLX10',
'ADES SOYA NATURAL TETRA', 'ADES COCO TETRA']}
df = pd.DataFrame(data)
print (df)
I am using the following code (dictionary) to create a new column based on a specific brand name (in this case Ades) -solution found in Pandas str.contains - Search for multiple values in a string and print the values in a new column)
brands =['Ades']
def matcher(x):
for i in brands:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Brands'] = df['Description'].apply(matcher)
It creates the column and applies the dictionary, but is not right.
Results:
and as of now if it finds ADES in any combination (like UNIDADES) it says that is Ades. What I am trying to accomplish is only Ades, not any combination of the word. This is a simple combination, but I have more than 10 million records and different brands. How to set up the dictionary only to find that word not a combination?
Thanks.
CodePudding user response:
Create a regex pattern to extract:
# (?i) insensitive, \b word boundary
pat = fr"(?i)\b({'|'.join(brands)})\b"
df['Brand'] = df['Description'].str.extract(pat, expand=False)
print(df)
# Output
Description Brand
0 CERVEZA DORADA BOTELLA NR 24 UNIDADES 350ML NaN
1 BEBIDA DE ALMENDRA COCO SILK 1/6/946 ML WHITE ... NaN
2 ADES SOYA ORAN TETRA 200MLX10 ADES
3 ADES SOYA NATURAL TETRA ADES
4 ADES COCO TETRA ADES