Wow to filter elements in pandas group within a range-CodePudding

I have the following pandas dataframe:

df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   'x': ['A', 'A', 'A', 'A', 'C', 'C','E', 'G', 'G', 'G'],
                   'y': ['B', 'B', 'B', 'B', 'D', 'D', 'F', 'H', 'H', 'H'],
                   'year': [1990, 1991, 1992, 1993, 1994, 1999, 1999, 2001, 2002, 2010]})

   id  x  y  year
0   1  A  B  1990
1   2  A  B  1991
2   3  A  B  1992
3   4  A  B  1993
4   5  C  D  1994
5   6  C  D  1999
6   7  E  F  1999
7   8  G  H  2001
8   9  G  H  2002
9  10  G  H  2010

for each groupby(['x', 'y']), I need only the count of the ids in which the year figures between the min year of the group and year 4

Expected result:

   x  y  count_id
0  A  B         4
1  C  D         1
2  E  F         1
3  G  H         2

CodePudding user response：

You can groupby and apply:

print (df.groupby(["x","y"])["year"].apply(lambda d: (d-d.min()).le(4).sum()))

x  y
A  B    4
C  D    1
E  F    1
G  H    2
Name: year, dtype: int64

CodePudding user response：

We can create a mask based on where the year is less than or equal to 4 years from the group minumum (groupby transform, then filter and groupby to count the number of ids:

cols = ['x', 'y']
m = df['year'].sub(df.groupby(cols)['year'].transform('min')).le(4)
new_df = df[m].groupby(cols, as_index=False)['id'].count()

Or with nunique if wanting only unique id count:

cols = ['x', 'y']
m = df['year'].sub(df.groupby(cols)['year'].transform('min')).le(4)
new_df = df[m].groupby(cols, as_index=False)['id'].nunique()

new_df:

   x  y  id
0  A  B   4
1  C  D   1
2  E  F   1
3  G  H   2

CodePudding user response：

Using .groupby Series.between:

x = (
    df.groupby(["x", "y"])
    .apply(lambda x: x["year"].between((mn := x["year"].min()), mn   4).sum())
    .reset_index(name="count_id")
)
print(x)

Prints:

   x  y  count_id
0  A  B         4
1  C  D         1
2  E  F         1
3  G  H         2