In my movie data dataframe I have a column named 'cast', which contains a string of all the cast members for that given movie separated by a pipe character.
For example, the movie 'Jurrassic World' has "Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D'Onofrio|Nick Robinson" in its cast column.
Some actors appear multiple times in the dataframe for separate movies.
I want to compare each separate cast member against another column called 'vote_average' and find each cast member's mean 'vote_average' for the all the movies that they have been in.
I have tried df['cast'].str.cat(sep = '|').split('|')
to get a list containing all actors, but not sure where to go from here?
CodePudding user response:
From what I could interpret from your question, you have a DataFrame that looks a bit like this:
import pandas as pd
df = pd.DataFrame({"film": ["Jurassic World", "Jurassic World: Fallen Kingdom"],
"cast": ["Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D'Onofrio|Nick Robinson",
"Chris Pratt|Bryce Dallas Howard|Rafe Spall"],
"vote_average": [5, 4]})
You then split all the actors in cast by "|"
to a list of actors:
df['cast'] = df['cast'].apply(lambda x: x.split('|'))
To find the average vote_average
for each actor, you can then explode the column so each actor is in a separate row:
df = df.explode('cast')
Then finally, group the actors, and calculate the mean vote_average
:
actors_mean_vote_avg = df.groupby('cast')['vote_average'].mean()
actors_mean_vote_avg
#Out:
#cast
#Bryce Dallas Howard 4.5
#Chris Pratt 4.5
#Irrfan Khan 5.0
#Nick Robinson 5.0
#Rafe Spall 4.0
#Vincent D'Onofrio 5.0
#Name: vote_average, dtype: float64
If this is not correct, please can you provide an example of your DataFrame, and an example of the desired output.
CodePudding user response:
Since I dind't had your DF I invented one from what I understood from your question:
List generator (just to exemplify your df):
x=int(input('Insert lenght (int): '))
y=str(input('Insert string: '))
lst=list([y]*x)
new_list=[]
for i in range(x):
new_list.append(lst[i] str(' ') str(i))
new_list.append('Jurrassic World ') # added your film
actors=['Vin Diesel|Shahrukh Khan|Salman Khan|Irrfan Khan',
'Vin Gasoline|Harrison Tesla|Salmon Rosa|Matt Angel|Demi Less',
'Not von Diesel|Ryan Davidson',
'Chris Bratt|Bread Butter|Bruce Wayno|Robinson Crusoe',
'Groot|Watzlav|David Bronzefield|Vin Diesel',
'Jessica Fox|Jamie Rabbit|Harrison Tesla|Salmon Rosa',
'Bryce Dallas Howard|David Bronzefield|Robinson Crusoe',
'Asterix|Garfield|Chris Pratt|Smurfix',
'Almost vin Diesel|Vin Gasoline|Dwayne Paper',
'Vin Gasoline|Jessica Fox|Demi Less',
'Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D`Onofrio|Nick Robinson'] # 11 rows
votes_average = np.random.uniform(low=6, high=9.8, size=(11,))
Here my df for the answer:
df=pd.DataFrame({'film' : new_list, 'actors': actors, 'imdb' : votes_average})
# First split the column with our cast and split it in other columns, named `cast_x`
part=df['actors'].str.split('|',expand=True).rename(columns= lambda x : 'cast_' str(x))
#Now joining to main df and creating df_new
df_new=pd.concat([df,beta],axis=1)
Now comes a complicated part, but you try it for your selft after each method and see what is happening to the df:
group = (df_new.filter(like='cast').stack()
.reset_index(level=1, drop=True)
.to_frame('casts')
.join(df)
.groupby('casts')
.agg({'imdb':(np.mean,np.size),'film': lambda x: list(pd.unique(x))}))
I found reasonable to use .agg
and get more statistics(you can apply np.min
and/or np.max
after the ,
as well).
I wanted to see the avg from how many movies np.size
and which movies did an actor do lambda with pd.unique
:
group.loc['Vin Gasoline']