My dataset looks as follows:
Area | Year | Value | Crop |
---|---|---|---|
Afghanistan | 1961 | 1000 | Corn |
Afghanistan | 1961 | 1100 | Potato |
Afghanistan | 1961 | 1200 | Wheat |
Afghanistan | 1962 | 1304 | Potato |
Afghanistan | 1962 | 1923 | Corn |
Bhutan | 1961 | 1236 | Corn |
Bhutan | 1961 | 1836 | Potato |
Bhutan | 1961 | 1031 | Wheat |
Bhutan | 1962 | 1111 | Corn |
I have a total of 60 years of data under consideration, with each country able to contain a max of one row per crop per year- to a total of 240 rows per year.
I have used the .value_counts
function to evaluate which countries are not whole and would like to create a subset containing only the countries that have a complete set of data (ie. where the .value_counts
function returned a value of 240 for that country).
I have tried doing this as follows:
df_240 = df.loc[df["Area"].value_counts == 240]
however I am having no luck.
The error I am recieving is:
KeyError: 'False: boolean label can not be used without a boolean index'
CodePudding user response:
This is what you are looking for:
df1 = df.groupby(["Area", "Year"]).filter(lambda x: len(x) == 240)
CodePudding user response:
Here is grouping by 'Area' and fetching 'Year' data is passed to transform where 'count' is used for counting. It turns out a logical mask, which is substituted for selecting rows. Indexes are dropped and drop=True is applied.
import pandas as pd
ind = df.groupby('Area')['Year'].transform('count') >= 240
df1 = df[ind].reset_index(drop=True)
print(df1)