Retrieve all occurrencies from selected attributes to separate column in pandas-CodePudding

want to extract color from the product descriptions. I tried to use NER but it was nt successful. Now I am trying to define a list and match it with description.

I have data in dataframe column like this:

Description: Tampered black round grey/natural swing with yellow load-bearing cap

I defined also the list of colors

attributes =['red','blue','black','violet','grey','natural','beige','silver']

What I did was to create a matcher

def matcher(x):
   for i in attributes:
       if i in x.lower():
           return i
   else:
       return np.nan

And I applied it to the df

df['Colours'] = df['Description pre-work'].apply(matcher)

The result is horrible too. I get result:

matcher('Tampered black round grey/natural swing with yellow load-bearing cap')

red

How can I retrieve all the matches into list and store them in separate column in pandas? Expected output:

['black','grey','natural','yellow']

How can I prevent having red as match where there is no red?

I thought I would use

findall function

to retrieve the data how I want them but also that doesnt help me...

Lost. Thanks for help!

CodePudding user response：

Jezreel's first answer is very good! however when using

df['Colours'] = df['Description pre-work'].str.findall('|'.join(attributes), flags=re.I)

it will always find red when words such as "Tampered " and such I suggest an easy quick fix (which is not the most robust one) but

def matcher(desc):
    colors = []
    # split sentence to words and find and exact much
    words = desc.lower().replace(';', ' ').replace('-', ' ').replace('/', ' ').split(" ")
    for color in attributes:
        if color in words:
            colors.append(color)
    return colors

CodePudding user response：

I suggest you first split your text into words to avoid the issue of returning "red" for "Tempered". Then you check the intersection of your color list with the split text.

import re

# use a set rather than a list, it will help getting the intersection easily
attributes = {'red','blue','black','violet','grey','natural','beige','silver', 'yellow'}

def get_colors(text: str) -> list[str]:
    # get the word split
    tokens = re.findall(r'\b\w \b', text.lower())
    # example tokens: ['Tampered', 'black', 'round', 'grey', 'natural', 'swing', 'with' ...]
    # now simply return the intersection of tokens with attributes, as a list
    return list(attributes.intersection(tokens))

# then apply your function
df['Colours'] = df['Description pre-work'].apply(get_colors)

CodePudding user response：

Use Series.str.findall with word boundaries for avoid select red in tempered:

import re
pa = '|'.join(r"\b{}\b".format(x) for x in attributes)
df['Colours'] = df['Description pre-work'].str.findall(pat, flags=re.I)