Could someone please clarify me this:
df = pd.DataFrame({'years': [2015, 2016, 2017,2017, 2018, 2019, 2019, 2020]})
df['years'] = df['years'].astype('category')
print(df.dtypes)
years category
dtype: object
now, I create a new variable to subset the years
column:
subset_years = [2015, 2016, 2017, 2018]
then, filter the years
:
subset_df = df[df['years'].isin(subset_years)]
print(subset_df)
years
0 2015
1 2016
2 2017
3 2017
4 2018
now, I take the unique elements:
subset_df.years.unique()
and I get:
[2015, 2016, 2017, 2018]
Categories (4, int64): [2015, 2016, 2017, 2018]
but, if I do subset_df.years.value_counts()
, I get:
2015 1
2016 1
2017 2
2018 1
2019 0
2020 0
Name: years, dtype: int64
My question is that why does subset_df.years.value_counts()
return 2019
and 2020
years and with count of 0
? Since I already filter the years
... was it not suppose to remove those years
during subset/filter?
Could someone please clarify what is happening?
CodePudding user response:
It's because 2019
and 2020
are still within the categories. You can reset category before value_counts
if you don't want filtered years to show up:
subset_df.years.cat.set_categories(subset_years).value_counts()
#2017 2
#2015 1
#2016 1
#2018 1
#Name: years, dtype: int64