Home > Net >  Find number of co-occurring elements between dataframe columns
Find number of co-occurring elements between dataframe columns

Time:10-22

I have a DataFrame that has a website, categories, and keywords for that website.

Url  | categories                                | keywords
Espn | [sport, nba, nfl]                         | [half, touchdown, referee,  player, goal]
Tmz  | [entertainment, sport]                    | [gossip, celebrity, player]
Goal [ [sport, premier_league, champions_league] | [football, goal, stadium, player, referee]

Which can be created using this code:

data = [{ 'Url': 'ESPN', 'categories': ['sport', 'nba', 'nfl'] ,
         'keywords': ["half", "touchdown", "referee",  "player", "goal"] },
         { 'Url': 'TMZ', 'categories': ["entertainment", "sport"] ,
           'keywords': ["gossip", "celebrity", "player"] },
         { 'Url': 'Goal', 'categories': ["sport", "premier_league", "champions_league"] ,
           'keywords': ["football", "goal", "stadium", "player", "referee"]},
       ]

df =pd.DataFrame(data)

For all the word in the keywords column, I want to get the frequency of categories associated with it. The results might look like this:

{half: {sport: 1, nba: 1, nfl: 1}, touchdown : {sport: 1, nba: 1, nfl: 1}, referee: {sport: 2, nba: 1, nfl: 1, premier_league: 1, champions_league:1 }, player: {sport: 3, nba: 1, nfl: 1, premier_league: 1, champions_league:1 }, gossip: {sport:1, entertainment:1}, celebrity: {sport:1, entertainment:1}, goal: {sport:2, premier_league:1, champions_league:1, nba: 1, nfl: 1}, stadium:{sport:1, premier_league:1, champions_league:1} }

CodePudding user response:

Since the columns contain lists, you can explode them to repeat a row once for each element per list:

result = (
    df.explode("keywords")
    .explode("categories")
    .groupby(["keywords", "categories"])
    .size()
)
  • Related