Home > Mobile >  value_counts() returns removed/filtered-out data in "Categorical" datatype in Pandas
value_counts() returns removed/filtered-out data in "Categorical" datatype in Pandas

Time:01-03

Could someone please clarify me this:

df = pd.DataFrame({'years': [2015, 2016, 2017,2017, 2018, 2019, 2019, 2020]})
df['years'] = df['years'].astype('category')

print(df.dtypes)
years    category
dtype: object

now, I create a new variable to subset the years column:

subset_years = [2015, 2016, 2017, 2018]

then, filter the years:

subset_df = df[df['years'].isin(subset_years)]
print(subset_df)

   years
0   2015
1   2016
2   2017
3   2017
4   2018

now, I take the unique elements:

subset_df.years.unique()

and I get:

[2015, 2016, 2017, 2018]
Categories (4, int64): [2015, 2016, 2017, 2018]

but, if I do subset_df.years.value_counts(), I get:

2015    1
2016    1
2017    2
2018    1
2019    0
2020    0
Name: years, dtype: int64

My question is that why does subset_df.years.value_counts() return 2019 and 2020 years and with count of 0 ? Since I already filter the years... was it not suppose to remove those years during subset/filter?

Could someone please clarify what is happening?

CodePudding user response:

It's because 2019 and 2020 are still within the categories. You can reset category before value_counts if you don't want filtered years to show up:

subset_df.years.cat.set_categories(subset_years).value_counts()
#2017    2
#2015    1
#2016    1
#2018    1
#Name: years, dtype: int64
  • Related