Home > database >  Create new column based on a dictionary not working
Create new column based on a dictionary not working

Time:07-11

I have the following dataframe:

df = pd.DataFrame()
data = {'Description':['CERVEZA DORADA BOTELLA NR 24 UNIDADES 350ML', 'BEBIDA DE ALMENDRA COCO SILK 1/6/946 ML WHITE WAVE (1788 UNIDADES)', 'ADES SOYA ORAN TETRA 200MLX10',
'ADES SOYA NATURAL TETRA', 'ADES COCO TETRA']}
df = pd.DataFrame(data)
print (df)

I am using the following code (dictionary) to create a new column based on a specific brand name (in this case Ades) -solution found in Pandas str.contains - Search for multiple values in a string and print the values in a new column)

brands =['Ades']

def matcher(x):
    for i in brands:
        if i.lower() in x.lower():
            return i
    else:
        return np.nan

df['Brands'] = df['Description'].apply(matcher)

It creates the column and applies the dictionary, but is not right.

Results:

and as of now if it finds ADES in any combination (like UNIDADES) it says that is Ades. What I am trying to accomplish is only Ades, not any combination of the word. This is a simple combination, but I have more than 10 million records and different brands. How to set up the dictionary only to find that word not a combination?

Thanks.

CodePudding user response:

Create a regex pattern to extract:

# (?i) insensitive, \b word boundary
pat = fr"(?i)\b({'|'.join(brands)})\b"

df['Brand'] = df['Description'].str.extract(pat, expand=False)
print(df)

# Output
                                         Description Brand
0        CERVEZA DORADA BOTELLA NR 24 UNIDADES 350ML   NaN
1  BEBIDA DE ALMENDRA COCO SILK 1/6/946 ML WHITE ...   NaN
2                      ADES SOYA ORAN TETRA 200MLX10  ADES
3                            ADES SOYA NATURAL TETRA  ADES
4                                    ADES COCO TETRA  ADES
  • Related