Assign additional labels to a string column in pandas-CodePudding

I'm using a dictionary to assign labels to sentences. Currently, it works well on a single label. But there're situations they could have one more labels. How can I do this from a string-search approach?

import pandas as pd

entertainment_dict = {
  "Food": ["McDonald", "Five Guys", "KFC"],
  "Music": ["Taylor Swift", "Jay Z", "One Direction"],
  "TV": ["Big Bang Theory", "Queen of South", "Ted Lasso"]
}

data = {'text':["Kevin Lee has bought a Taylor Swift's CD and eaten at McDonald.", 
                "The best burger in McDonald is cheeze buger.",
                "Kevin Lee is planning to watch the Big Bang Theory and eat at KFC."]}

df = pd.DataFrame(data)

regex = '|'.join(f'(?P<{k}>{"|".join(v)})' for k,v in entertainment_dict.items())
df['labels'] = ((df['text'].str.extract(regex).notnull()*entertainment_dict.keys())
                 .apply(lambda r: ','.join([i for i in r if i]) , axis=1)
                )
                                                text labels
0  Kevin Lee has bought a Taylor Swift's CD and e...  Music
1       The best burger in McDonald is cheeze buger.   Food
2  Kevin Lee is planning to watch the Big Bang Th...     TV

CodePudding user response：

Modify extract into extractall and join the matches with groupby:

regex = '|'.join(f'(?P<{k}>{"|".join(v)})' for k,v in entertainment_dict.items())
df['labels'] = ((df['text'].str.extractall(regex).notnull().groupby(level=0).max()*entertainment_dict.keys())
                 .apply(lambda r: ','.join([i for i in r if i]) , axis=1)
                )

In summary, change df['text'].str.extract(regex).notnull() into df['text'].str.extractall(regex).notnull().groupby(level=0).max()

output:

                                                text      labels
0  Kevin Lee has bought a Taylor Swift's CD and e...  Food,Music
1       The best burger in McDonald is cheeze buger.        Food
2  Kevin Lee is planning to watch the Big Bang Th...     Food,TV