How can I find the mean 'vote_average' for each actor?-CodePudding

In my movie data dataframe I have a column named 'cast', which contains a string of all the cast members for that given movie separated by a pipe character.

For example, the movie 'Jurrassic World' has "Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D'Onofrio|Nick Robinson" in its cast column.

Some actors appear multiple times in the dataframe for separate movies.

I want to compare each separate cast member against another column called 'vote_average' and find each cast member's mean 'vote_average' for the all the movies that they have been in.

I have tried df['cast'].str.cat(sep = '|').split('|') to get a list containing all actors, but not sure where to go from here?

CodePudding user response：

From what I could interpret from your question, you have a DataFrame that looks a bit like this:

import pandas as pd
df = pd.DataFrame({"film": ["Jurassic World", "Jurassic World: Fallen Kingdom"],
                   "cast": ["Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D'Onofrio|Nick Robinson",
                            "Chris Pratt|Bryce Dallas Howard|Rafe Spall"],
                   "vote_average": [5, 4]})

You then split all the actors in cast by "|" to a list of actors:

df['cast'] = df['cast'].apply(lambda x: x.split('|'))

To find the average vote_average for each actor, you can then explode the column so each actor is in a separate row:

df = df.explode('cast')

Then finally, group the actors, and calculate the mean vote_average:

actors_mean_vote_avg = df.groupby('cast')['vote_average'].mean()
actors_mean_vote_avg
#Out: 
#cast
#Bryce Dallas Howard    4.5
#Chris Pratt            4.5
#Irrfan Khan            5.0
#Nick Robinson          5.0
#Rafe Spall             4.0
#Vincent D'Onofrio      5.0
#Name: vote_average, dtype: float64

If this is not correct, please can you provide an example of your DataFrame, and an example of the desired output.

CodePudding user response：

Since I dind't had your DF I invented one from what I understood from your question:

List generator (just to exemplify your df):

x=int(input('Insert lenght (int):    '))
y=str(input('Insert string:     '))
lst=list([y]*x)
new_list=[]
for i in range(x):
  new_list.append(lst[i] str(' ') str(i))
new_list.append('Jurrassic World ') # added your film 
actors=['Vin Diesel|Shahrukh Khan|Salman Khan|Irrfan Khan',
  'Vin Gasoline|Harrison Tesla|Salmon Rosa|Matt Angel|Demi Less',
  'Not von Diesel|Ryan Davidson',
  'Chris Bratt|Bread Butter|Bruce Wayno|Robinson Crusoe',
  'Groot|Watzlav|David Bronzefield|Vin Diesel',
  'Jessica Fox|Jamie Rabbit|Harrison Tesla|Salmon Rosa',
  'Bryce Dallas Howard|David Bronzefield|Robinson Crusoe',
  'Asterix|Garfield|Chris Pratt|Smurfix',
  'Almost vin Diesel|Vin Gasoline|Dwayne Paper',
  'Vin Gasoline|Jessica Fox|Demi Less',
  'Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D`Onofrio|Nick Robinson'] # 11 rows
votes_average = np.random.uniform(low=6, high=9.8, size=(11,))

Here my df for the answer:

df=pd.DataFrame({'film' : new_list, 'actors': actors, 'imdb' : votes_average})
# First split the column with our cast and split it in other columns, named `cast_x`
part=df['actors'].str.split('|',expand=True).rename(columns= lambda x : 'cast_' str(x))
#Now joining to main df and creating df_new
df_new=pd.concat([df,beta],axis=1)

Now comes a complicated part, but you try it for your selft after each method and see what is happening to the df:

group = (df_new.filter(like='cast').stack()
                                .reset_index(level=1, drop=True)
                                .to_frame('casts')
                                .join(df)
                                .groupby('casts')
                                .agg({'imdb':(np.mean,np.size),'film': lambda x: list(pd.unique(x))}))

I found reasonable to use .agg and get more statistics(you can apply np.min and/or np.max after the , as well).

I wanted to see the avg from how many movies np.size and which movies did an actor do lambda with pd.unique:

group.loc['Vin Gasoline']