Using groupby() on an exploded pandas DataFrame returns a data frame where indeces are repeated but-CodePudding

I am working with a dataset found in kaggle (

CodePudding user response：

try cleaning the column country before groupby

df_split_countries['country'] = df_split_countries['country'].str.strip()
df_split_countries['country'] = df_split_countries['country'].map(lambda x: .encode(x, errors="ignore").decode())

I think countries are repeated because they are not exactly the same string values, maybe one contain extra space...

CodePudding user response：

Have you considered using a flag for coproductions and then explode?

import pandas as pd
df = pd.read_csv('netflix_titles.csv')
# drop null countries
df = df[df["country"].notnull()].reset_index(drop=True)

# Flag for coproductions
df["countries_coproduction"] = df['country']\
    .str.split(',').apply(len).gt(1)

# Explode
df = df.assign(
    country=df['country'].str.split(','))\
   .explode('country')

Then you can easily estract the top 10 countries in each case as

grp[grp["countries_coproduction"]].nlargest(10, "n")\
    .reset_index(drop=True)

   countries_coproduction          country    n
0                    True    United States  479
1                    True    United States  393
2                    True   United Kingdom  209
3                    True           France  181
4                    True   United Kingdom  178
5                    True           Canada  174
6                    True          Germany  123
7                    True           Canada   90
8                    True           France   88
9                    True          Belgium   72

and

grp[~grp["countries_coproduction"]].nlargest(10, "n")\
    .reset_index(drop=True)

0                   False   United States  2818
1                   False           India   972
2                   False  United Kingdom   419
3                   False           Japan   245
4                   False     South Korea   199
5                   False          Canada   181
6                   False           Spain   145
7                   False          France   124
8                   False          Mexico   110
9                   False           Egypt   106