Home > Net >  Create subsets based on output of the value_counts function
Create subsets based on output of the value_counts function

Time:12-06

My dataset looks as follows:

Area Year Value Crop
Afghanistan 1961 1000 Corn
Afghanistan 1961 1100 Potato
Afghanistan 1961 1200 Wheat
Afghanistan 1962 1304 Potato
Afghanistan 1962 1923 Corn
Bhutan 1961 1236 Corn
Bhutan 1961 1836 Potato
Bhutan 1961 1031 Wheat
Bhutan 1962 1111 Corn

I have a total of 60 years of data under consideration, with each country able to contain a max of one row per crop per year- to a total of 240 rows per year. I have used the .value_counts function to evaluate which countries are not whole and would like to create a subset containing only the countries that have a complete set of data (ie. where the .value_counts function returned a value of 240 for that country).

I have tried doing this as follows: df_240 = df.loc[df["Area"].value_counts == 240] however I am having no luck.

The error I am recieving is:

KeyError: 'False: boolean label can not be used without a boolean index'

CodePudding user response:

This is what you are looking for:

df1 = df.groupby(["Area", "Year"]).filter(lambda x: len(x) == 240)

CodePudding user response:

Here is grouping by 'Area' and fetching 'Year' data is passed to transform where 'count' is used for counting. It turns out a logical mask, which is substituted for selecting rows. Indexes are dropped and drop=True is applied.

import pandas as pd

ind = df.groupby('Area')['Year'].transform('count') >= 240
df1 = df[ind].reset_index(drop=True)

print(df1)
  • Related