I have a DataFrame that has a website, categories, and keywords for that website.
Url | categories | keywords
Espn | [sport, nba, nfl] | [half, touchdown, referee, player, goal]
Tmz | [entertainment, sport] | [gossip, celebrity, player]
Goal [ [sport, premier_league, champions_league] | [football, goal, stadium, player, referee]
Which can be created using this code:
data = [{ 'Url': 'ESPN', 'categories': ['sport', 'nba', 'nfl'] ,
'keywords': ["half", "touchdown", "referee", "player", "goal"] },
{ 'Url': 'TMZ', 'categories': ["entertainment", "sport"] ,
'keywords': ["gossip", "celebrity", "player"] },
{ 'Url': 'Goal', 'categories': ["sport", "premier_league", "champions_league"] ,
'keywords': ["football", "goal", "stadium", "player", "referee"]},
]
df =pd.DataFrame(data)
For all the word in the keywords column, I want to get the frequency of categories associated with it. The results might look like this:
{half: {sport: 1, nba: 1, nfl: 1}, touchdown : {sport: 1, nba: 1, nfl: 1}, referee: {sport: 2, nba: 1, nfl: 1, premier_league: 1, champions_league:1 }, player: {sport: 3, nba: 1, nfl: 1, premier_league: 1, champions_league:1 }, gossip: {sport:1, entertainment:1}, celebrity: {sport:1, entertainment:1}, goal: {sport:2, premier_league:1, champions_league:1, nba: 1, nfl: 1}, stadium:{sport:1, premier_league:1, champions_league:1} }
CodePudding user response:
Since the columns contain lists, you can explode them to repeat a row once for each element per list:
result = (
df.explode("keywords")
.explode("categories")
.groupby(["keywords", "categories"])
.size()
)