How can I get the top 5 movie genres for the Unites States from my movie data frame by using pandas-CodePudding

I have a movie data frame where I want to extract the top 5 movie genres of the United States, I thought about using group by, however, it doesn't work as it considered my genre column (listed_in) as a string. How could this be done?

Here is what is tried:

netflix_df.groupby(['country']['listed_in']).count().sort_values(ascending = False).head(5)

Data frame information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int32 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
dtypes: int32(1), object(10)
memory usage: 722.6  KB

Snippet of the dataframe

0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States September 25, 2021 2020 PG-13 90 min Documentaries

1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries

2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act...

3 s4 TV Show Jailbirds New Orleans NaN NaN NaN September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV

4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV

CodePudding user response：

You could filter with .loc and then use .value_counts

For example:

netflix_df.loc[netflix_df['country'] == 'US']['listed_in'].value_counts()[:5]

If you want to extract the values or the indexes in a list:

# values
netflix_df.loc[netflix_df['country'] == 'US']['listed_in'].value_counts()[:5].tolist()
# indexes
netflix_df.loc[netflix_df['country'] == 'US']['listed_in'].value_counts()[:5].index.tolist()

#Edit

Asking the question, if your genre field have more than one, you have to split the string, iterate through them and store the count in some way. I have developed a little function that may help you.

def getGenres(series):
    genres = {}

    for row in series:
        if isinstance(row, str):
            genreList = row.split(',')
            for genre in genreList:
                if genre.strip().title() in genres:
                    genres[genre.strip().title()]  = 1
                else:
                    genres[genre.strip().title()] = 1       
    
    return genres

Then, just

getGenres(netflix_df.loc[netflix_df['country'] == 'US']['listed_in'])

And to sort a dict by value check it here.