I am working with a dataset found in kaggle (
CodePudding user response:
try cleaning the column country before groupby
df_split_countries['country'] = df_split_countries['country'].str.strip()
df_split_countries['country'] = df_split_countries['country'].map(lambda x: .encode(x, errors="ignore").decode())
I think countries are repeated because they are not exactly the same string values, maybe one contain extra space...
CodePudding user response:
Have you considered using a flag for coproductions and then explode?
import pandas as pd
df = pd.read_csv('netflix_titles.csv')
# drop null countries
df = df[df["country"].notnull()].reset_index(drop=True)
# Flag for coproductions
df["countries_coproduction"] = df['country']\
.str.split(',').apply(len).gt(1)
# Explode
df = df.assign(
country=df['country'].str.split(','))\
.explode('country')
Then you can easily estract the top 10 countries in each case as
grp[grp["countries_coproduction"]].nlargest(10, "n")\
.reset_index(drop=True)
countries_coproduction country n
0 True United States 479
1 True United States 393
2 True United Kingdom 209
3 True France 181
4 True United Kingdom 178
5 True Canada 174
6 True Germany 123
7 True Canada 90
8 True France 88
9 True Belgium 72
and
grp[~grp["countries_coproduction"]].nlargest(10, "n")\
.reset_index(drop=True)
0 False United States 2818
1 False India 972
2 False United Kingdom 419
3 False Japan 245
4 False South Korea 199
5 False Canada 181
6 False Spain 145
7 False France 124
8 False Mexico 110
9 False Egypt 106