I have the following pandas dataframe:
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'x': ['A', 'A', 'A', 'A', 'C', 'C','E', 'G', 'G', 'G'],
'y': ['B', 'B', 'B', 'B', 'D', 'D', 'F', 'H', 'H', 'H'],
'year': [1990, 1991, 1992, 1993, 1994, 1999, 1999, 2001, 2002, 2010]})
id x y year
0 1 A B 1990
1 2 A B 1991
2 3 A B 1992
3 4 A B 1993
4 5 C D 1994
5 6 C D 1999
6 7 E F 1999
7 8 G H 2001
8 9 G H 2002
9 10 G H 2010
for each groupby(['x', 'y'])
, I need only the count of the ids in which the year figures between the min year of the group and year 4
Expected result:
x y count_id
0 A B 4
1 C D 1
2 E F 1
3 G H 2
CodePudding user response:
You can groupby
and apply
:
print (df.groupby(["x","y"])["year"].apply(lambda d: (d-d.min()).le(4).sum()))
x y
A B 4
C D 1
E F 1
G H 2
Name: year, dtype: int64
CodePudding user response:
We can create a mask based on where the year is less than or equal to 4 years from the group minumum (groupby transform
, then filter and groupby to count
the number of ids:
cols = ['x', 'y']
m = df['year'].sub(df.groupby(cols)['year'].transform('min')).le(4)
new_df = df[m].groupby(cols, as_index=False)['id'].count()
Or with nunique
if wanting only unique id count:
cols = ['x', 'y']
m = df['year'].sub(df.groupby(cols)['year'].transform('min')).le(4)
new_df = df[m].groupby(cols, as_index=False)['id'].nunique()
new_df
:
x y id
0 A B 4
1 C D 1
2 E F 1
3 G H 2
CodePudding user response:
Using .groupby
Series.between
:
x = (
df.groupby(["x", "y"])
.apply(lambda x: x["year"].between((mn := x["year"].min()), mn 4).sum())
.reset_index(name="count_id")
)
print(x)
Prints:
x y count_id
0 A B 4
1 C D 1
2 E F 1
3 G H 2