I am using a large IMDB dataset to do some practicing.
df_movies[(df_movies['movie'] == 'Fight Club (1999)') | (df_movies['movie'] == 'Se7en (1995)')].value_counts('actor')
The above query has given me this dataset.
movie actor year duration adult_movie rating votes
3506 Se7en (1995) Brad Pitt (b.1963) 1995 127.0 0 8.6 1342867.0
3513 Fight Club (1999) Brad Pitt (b.1963) 1999 139.0 0 8.8 1742920.0
5380 Se7en (1995) Morgan Freeman (b.1937) 1995 127.0 0 8.6 1342867.0
8065 Se7en (1995) Kevin Spacey (b.1959) 1995 127.0 0 8.6 1342867.0
40124 Fight Club (1999) Meat Loaf (b.1947) 1999 139.0 0 8.8 1742920.0
40905 Fight Club (1999) Edward Norton (b.1969) 1999 139.0 0 8.8 1742920.0
46730 Se7en (1995) Andrew Kevin Walker (b.1964) 1995 127.0 0 8.6 1342867.0
390010 Fight Club (1999) Zach Grenier (b.1954) 1999 139.0 0 8.8 1742920.0
Brad Pitt is the only actor on this list to appear in both movies. How can I filter it such that Brad Pitt is returned?
CodePudding user response:
Using pd.value_counts()
, you can obtain the number of occurrences of each actor in the list. The result is sorted from highest to lowest.
df.drop_duplicates(['movie', 'actor'])['actor'].value_counts().index[0]
CodePudding user response:
Replace value_counts
by groupby_filter
:
movies = ['Fight Club (1999)', 'Se7en (1995)']
both_movies = lambda x: len(x) == len(movies)
out = df_movies[df_movies['movie'].isin(movies)].groupby('actor').filter(both_movies)
Output:
>>> out
movie actor year duration adult_movie rating votes
3506 Se7en (1995) Brad Pitt (b.1963) 1995 127.0 0 8.6 1342867.0
3513 Fight Club (1999) Brad Pitt (b.1963) 1999 139.0 0 8.8 1742920.0