I'm using a dictionary to assign labels to sentences. Currently, it works well on a single label. But there're situations they could have one more labels. How can I do this from a string-search approach?
import pandas as pd
entertainment_dict = {
"Food": ["McDonald", "Five Guys", "KFC"],
"Music": ["Taylor Swift", "Jay Z", "One Direction"],
"TV": ["Big Bang Theory", "Queen of South", "Ted Lasso"]
}
data = {'text':["Kevin Lee has bought a Taylor Swift's CD and eaten at McDonald.",
"The best burger in McDonald is cheeze buger.",
"Kevin Lee is planning to watch the Big Bang Theory and eat at KFC."]}
df = pd.DataFrame(data)
regex = '|'.join(f'(?P<{k}>{"|".join(v)})' for k,v in entertainment_dict.items())
df['labels'] = ((df['text'].str.extract(regex).notnull()*entertainment_dict.keys())
.apply(lambda r: ','.join([i for i in r if i]) , axis=1)
)
text labels
0 Kevin Lee has bought a Taylor Swift's CD and e... Music
1 The best burger in McDonald is cheeze buger. Food
2 Kevin Lee is planning to watch the Big Bang Th... TV
CodePudding user response:
Modify extract
into extractall
and join the matches with groupby
:
regex = '|'.join(f'(?P<{k}>{"|".join(v)})' for k,v in entertainment_dict.items())
df['labels'] = ((df['text'].str.extractall(regex).notnull().groupby(level=0).max()*entertainment_dict.keys())
.apply(lambda r: ','.join([i for i in r if i]) , axis=1)
)
In summary, change df['text'].str.extract(regex).notnull()
into df['text'].str.extractall(regex).notnull().groupby(level=0).max()
output:
text labels
0 Kevin Lee has bought a Taylor Swift's CD and e... Food,Music
1 The best burger in McDonald is cheeze buger. Food
2 Kevin Lee is planning to watch the Big Bang Th... Food,TV