Pandas Dataframe has a column that contains 'list' data-structure as entries. How to Group-CodePudding

Find top 3 producers who have produced movies with highest average ROI (Return on invesment)

Description: (I am being provided with a table)

import pandas as pd
import numpy as np
        
table = pd.DataFrame({'Movie_title':['Hot Tub Time Machine 2','The Princess Diaries 2: Royal Engagement','Whiplash','Kahaani','마린보이'],'Producers':[['Andrew Panay','Jason Blum'],['Whitney Houston', 'Mario Iscovich', 'Michel Litvak'],['David Lancaster', 'Michel Litvak', 'Jason Blum', 'Helen Estabrook'],['Sujoy Ghosh'],[]],'Directors':[['Steve Pink'],['Garry Marshall'],['Damien Chazelle'],['Sujoy Ghosh'],['Jong-seok Yoon']],'ROI':[-12.038207142857143,137.8735875,296.72727272727275,1233.3333333333333,-76.14607902735563]})

this is what the table dataframe looks like table DataFrame

I thought of applying '.groupby()' method on 'Producers' column and then using '.mean()' method on ROI column

table.groupby('Producers')[['Movie Title','ROI','Directors']].mean('ROI')

But its throwing an error (Please refer the following image)

last line of error

Please Refer the image below to see the error. I dont know how to add jupyter notebook code outputs and pandas dataframe Please help me out with this problem statement. I provided the images of the code blocks from jupyter notebooks.

CodePudding user response：

You can use the pandas explode function on the Producers column – for each element of the list in every row, you'll get a new row (with the data from the index and other columns duplicated).

table.explode("Producers")

                                Movie_title        Producers          Directors          ROI
0                    Hot Tub Time Machine 2     Andrew Panay       [Steve Pink]   -12.038207
0                    Hot Tub Time Machine 2       Jason Blum       [Steve Pink]   -12.038207
1  The Princess Diaries 2: Royal Engagement  Whitney Houston   [Garry Marshall]   137.873588
1  The Princess Diaries 2: Royal Engagement   Mario Iscovich   [Garry Marshall]   137.873588
1  The Princess Diaries 2: Royal Engagement    Michel Litvak   [Garry Marshall]   137.873588
2                                  Whiplash  David Lancaster  [Damien Chazelle]   296.727273
2                                  Whiplash    Michel Litvak  [Damien Chazelle]   296.727273
2                                  Whiplash       Jason Blum  [Damien Chazelle]   296.727273
2                                  Whiplash  Helen Estabrook  [Damien Chazelle]   296.727273
3                                   Kahaani      Sujoy Ghosh      [Sujoy Ghosh]  1233.333333
4                                      마린보이              NaN   [Jong-seok Yoon]   -76.146079

Then you can apply your mean groupby (and combine it with explode for a one line solution):

table.explode("Producers").groupby("Producers").mean("ROI")

Final result:

                         ROI
Producers                   
Andrew Panay      -12.038207
David Lancaster   296.727273
Helen Estabrook   296.727273
Jason Blum        142.344533
Mario Iscovich    137.873588
Michel Litvak     217.300430
Sujoy Ghosh      1233.333333
Whitney Houston   137.873588