I have a movie data frame where I want to extract the top 5 movie genres of the United States, I thought about using group by, however, it doesn't work as it considered my genre column (listed_in) as a string. How could this be done?
Here is what is tried:
netflix_df.groupby(['country']['listed_in']).count().sort_values(ascending = False).head(5)
Data frame information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int32
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
dtypes: int32(1), object(10)
memory usage: 722.6 KB
Snippet of the dataframe
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States September 25, 2021 2020 PG-13 90 min Documentaries
1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act...
3 s4 TV Show Jailbirds New Orleans NaN NaN NaN September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV
4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV
CodePudding user response:
You could filter with .loc and then use .value_counts
For example:
netflix_df.loc[netflix_df['country'] == 'US']['listed_in'].value_counts()[:5]
If you want to extract the values or the indexes in a list:
# values
netflix_df.loc[netflix_df['country'] == 'US']['listed_in'].value_counts()[:5].tolist()
# indexes
netflix_df.loc[netflix_df['country'] == 'US']['listed_in'].value_counts()[:5].index.tolist()
#Edit
Asking the question, if your genre field have more than one, you have to split the string, iterate through them and store the count in some way. I have developed a little function that may help you.
def getGenres(series):
genres = {}
for row in series:
if isinstance(row, str):
genreList = row.split(',')
for genre in genreList:
if genre.strip().title() in genres:
genres[genre.strip().title()] = 1
else:
genres[genre.strip().title()] = 1
return genres
Then, just
getGenres(netflix_df.loc[netflix_df['country'] == 'US']['listed_in'])
And to sort a dict by value check it here.