I have a dictionary of lists and I want to use it to label the sentences. What's the most efficient way to do this?
entertainment_dict = {
"Food": ["McDonald", "Five Guys", "KFC"],
"Music": ["Taylor Swift", "Jay Z", "One Direction"],
"TV": ["Big Bang Theory", "Queen of South", "Ted Lasso"]
}
{'Food': ['McDonald', 'Five Guys', 'KFC'], 'Music': ['Taylor Swift', 'Jay Z', 'One Direction'], 'TV': ['Big Bang Theory', 'Queen of South', 'Ted Lasso']}
data = {'text':["Kevin Lee has bought a Taylor Swift's CD.",
"The best burger in McDonald is cheeze buger.",
"Kevin McDonald is planning to watch the Big Bang Theory."]}
df = pd.DataFrame(data)
text
0 Kevin Lee has bought a Taylor Swift's CD.
1 The best burger in McDonald is cheeze buger.
2 Kevin Lee is planning to watch the Big Ba...
Expected output:
text labels
0 Kevin Lee has bought a Taylor Swift's CD. Music
1 The best burger in McDonald is cheeze buger. Food
2 Kevin Lee is planning to watch the Big Ba... TV
CodePudding user response:
Like in your previous questions, you can craft a custom regex to use with extract
:
regex = '|'.join(f'(?P<{k}>{"|".join(v)})' for k,v in entertainment_dict.items())
df['labels'] = ((df['text'].str.extract(regex).notnull()*entertainment_dict.keys())
.apply(lambda r: ','.join([i for i in r if i]) , axis=1)
)
NB. The regex here would be '(?P<Food>McDonald|Five Guys|KFC)|(?P<Music>Taylor Swift|Jay Z|One Direction)|(?P<TV>Big Bang Theory|Queen of South|Ted Lasso)'
output:
text labels
0 Kevin Lee has bought a Taylor Swift's CD. Music
1 The best burger in McDonald is cheeze buger. Food
2 Kevin McDonald is planning to watch the Big Ba... Food