Home > Enterprise >  Match both dicitonary key-values with pandas dataframe rows
Match both dicitonary key-values with pandas dataframe rows

Time:10-16

I can match each row with each diciotnary key but I am wondering if there's a way I can get the related value (string) in a different column as well.

import pandas as pd

entertainment_dict = {
  "Food": ["McDonald", "Five Guys", "KFC"],
  "Music": ["Taylor Swift", "Jay Z", "One Direction"],
  "TV": ["Big Bang Theory", "Queen of South", "Ted Lasso"]
}

data = {'text':["Kevin Lee has bought a Taylor Swift's CD and eaten at McDonald.", 
                "The best burger in McDonald is cheeze buger.",
                "Kevin Lee is planning to watch the Big Bang Theory and eat at KFC."]}

df = pd.DataFrame(data)

regex = '|'.join(f'(?P<{k}>{"|".join(v)})' for k,v in entertainment_dict.items())
df['labels'] = ((df['text'].str.extractall(regex).notnull().groupby(level=0).max()*entertainment_dict.keys())
                 .apply(lambda r: ','.join([i for i in r if i]) , axis=1)
                )

                                                text      labels
0  Kevin Lee has bought a Taylor Swift's CD and e...  Food,Music
1       The best burger in McDonald is cheeze buger.        Food
2  Kevin Lee is planning to watch the Big Bang Th...     Food,TV

Expected output

                                                text      labels words
0  Kevin Lee has bought a Taylor Swift's CD and e...  Food,Music Taylor Swift, McDonald
1       The best burger in McDonald is cheeze buger.        Food McDonald
2  Kevin Lee is planning to watch the Big Bang Th...     Food,TV Big Bang Theory, KFC

CodePudding user response:

You could use:

df['words'] = (df['text'].str.extractall(regex)
                         .groupby(level=0).first()
                         .apply(lambda x: ','.join(set(x).difference([None])),
                                axis=1)
              )

output:

                                    text      labels                  words
0     Kevin Lee has bought ... McDonald.  Food,Music  Taylor Swift,McDonald
1   The best burger in ... cheeze buger.        Food               McDonald
2  Kevin Lee is planning ... eat at KFC.     Food,TV    KFC,Big Bang Theory

CodePudding user response:

Use DataFrame.stack with convert first level to column by reset_index, so possible join values in GroupBy.agg, for unique values in order is used dict.fromkeys trick:

uniq = lambda x: ','.join(dict.fromkeys(x).keys())
df[['label','words']] = (df['text'].str.extractall(regex)
                                   .stack()
                                   .reset_index(level=-1)
                                   .groupby(level=0)
                                   .agg(uniq))

print (df)
                                                text       label  \
0  Kevin Lee has bought a Taylor Swift's CD and e...  Music,Food   
1       The best burger in McDonald is cheeze buger.        Food   
2  Kevin Lee is planning to watch the Big Bang Th...     TV,Food   

                   words  
0  Taylor Swift,McDonald  
1               McDonald  
2    Big Bang Theory,KFC  
  • Related