I am working on the following dataset: https://drive.google.com/file/d/1UVgSfIO-46aLKHeyk2LuKV6nVyFjBdWX/view?usp=sharing
I am trying to replace the countries in the "Nationality" column whose value_counts() are less than 450 with the value of "Others".
def collapse_category(df):
df.loc[df['Nationality'].map(df['Nationality'].value_counts(normalize=True)
.lt(450)), 'Nationality'] = 'Others'
print(df['Nationality'].unique())
This is the code I used but it returns the result as this: ['Others'] Here is the link to my notebook for reference: https://colab.research.google.com/drive/1MfwwBfi9_4E1BaZcPnS7KJjTy8xVsgZO?usp=sharing
CodePudding user response:
Use boolean indexing:
s = df['Nationality'].value_counts()
df.loc[df['Nationality'].isin(s[s<450].index), 'Nationality'] = 'Others'
New value_counts
after the change:
FRA 12307
PRT 11382
DEU 10164
GBR 8610
Others 5354
ESP 4864
USA 3398
... ...
FIN 632
RUS 578
ROU 475
Name: Nationality, dtype: int64
CodePudding user response:
value_filter = df.Nationality.value_counts().lt(450)
temp_dict = value_filter[value_filter == False].replace({False: "others"}).to_dict()
df = df.replace(temp_dict)
In general, the third line will look up the entire df rather than a particular column. But the above code will work for you.