want to extract color from the product descriptions. I tried to use NER but it was nt successful. Now I am trying to define a list and match it with description.
I have data in dataframe column like this:
Description: Tampered black round grey/natural swing with yellow load-bearing cap
I defined also the list of colors
attributes =['red','blue','black','violet','grey','natural','beige','silver']
What I did was to create a matcher
def matcher(x):
for i in attributes:
if i in x.lower():
return i
else:
return np.nan
And I applied it to the df
df['Colours'] = df['Description pre-work'].apply(matcher)
The result is horrible too. I get result:
matcher('Tampered black round grey/natural swing with yellow load-bearing cap')
red
How can I retrieve all the matches into list and store them in separate column in pandas? Expected output:
['black','grey','natural','yellow']
How can I prevent having red as match where there is no red?
I thought I would use
findall function
to retrieve the data how I want them but also that doesnt help me...
Lost. Thanks for help!
CodePudding user response:
Jezreel's first answer is very good! however when using
df['Colours'] = df['Description pre-work'].str.findall('|'.join(attributes), flags=re.I)
it will always find red when words such as "Tampered " and such I suggest an easy quick fix (which is not the most robust one) but
def matcher(desc):
colors = []
# split sentence to words and find and exact much
words = desc.lower().replace(';', ' ').replace('-', ' ').replace('/', ' ').split(" ")
for color in attributes:
if color in words:
colors.append(color)
return colors
CodePudding user response:
I suggest you first split your text into words to avoid the issue of returning "red" for "Tempered". Then you check the intersection of your color list with the split text.
import re
# use a set rather than a list, it will help getting the intersection easily
attributes = {'red','blue','black','violet','grey','natural','beige','silver', 'yellow'}
def get_colors(text: str) -> list[str]:
# get the word split
tokens = re.findall(r'\b\w \b', text.lower())
# example tokens: ['Tampered', 'black', 'round', 'grey', 'natural', 'swing', 'with' ...]
# now simply return the intersection of tokens with attributes, as a list
return list(attributes.intersection(tokens))
# then apply your function
df['Colours'] = df['Description pre-work'].apply(get_colors)
CodePudding user response:
Use Series.str.findall
with word boundaries for avoid select red
in tempered
:
import re
pa = '|'.join(r"\b{}\b".format(x) for x in attributes)
df['Colours'] = df['Description pre-work'].str.findall(pat, flags=re.I)