I have a pandas df of the following format
STOCK YR MONTH DAY PRICE
AAA 2022 1 1 10
AAA 2022 1 2 11
AAA 2022 1 3 10
AAA 2022 1 4 15
AAA 2022 1 5 10
BBB 2022 1 1 5
BBB 2022 1 2 10
BBB 2022 2 1 10
BBB 2022 2 2 15
What I am looking to do is to filter this df
such that I am grouping by STOCK
and YR
and MONTH
and selecting the groups
with 3 or more entries.
So the resulting df
looks like
STOCK YR MONTH DAY PRICE
AAA 2022 1 1 10
AAA 2022 1 2 11
AAA 2022 1 3 10
AAA 2022 1 4 15
AAA 2022 1 5 10
Note that BBB
is eliminated as it had only 2 rows in each group
, when grouped by STOCK
, YR
and MONTH
I have tried df.groupby(['STOCK','YR','MONTH']).filter(lambda x: x.STOCK.nunique() > 5)
but this resulted in an empty
frame.
Also tried df.groupby(['STOCK','YR','MONTH']).filter(lambda x: x['STOCK','YR','MONTH'].nunique() > 5)
but this resulted in a KeyError: ('STOCK', 'YR', 'MONTH')
Thanks!
CodePudding user response:
Use GroupBy.transform('count')
:
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('count').ge(3)]
or 'size':
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('size').ge(3)]
output:
STOCK YR MONTH DAY PRICE
0 AAA 2022 1 1 10
1 AAA 2022 1 2 11
2 AAA 2022 1 3 10
3 AAA 2022 1 4 15
4 AAA 2022 1 5 10
CodePudding user response:
Use GroupBy.transform
:
If need counts (not exclude possible NaN
s):
#if need test number of unique values
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('size').gt(3)]
Or:
#in large df should be slow
df.groupby(['STOCK','YR','MONTH']).filter(lambda x: len(x) > 3)