I have a df like that:
month | stock | MV |
---|---|---|
1994-07 | A | 50 |
1994-07 | B | 60 |
1994-07 | C | 70 |
1994-07 | D | 80 |
1994-08 | A | 90 |
1994-08 | B | 60 |
1994-08 | C | 70 |
1994-08 | D | 95 |
1994-08 | E | 100 |
1994-08 | F | 110 |
I would like to subset my df in a way that I only have in it the 50% of the highest MV per month. For July/1994 I only have 4 stock, so 50% will be the 2 highest MV. For the month after, I have 6 stocks, which gives me 3 highest values:
month | stock | MV |
---|---|---|
1994-07 | C | 70 |
1994-07 | D | 80 |
1994-08 | D | 95 |
1994-08 | E | 100 |
1994-08 | F | 110 |
I have tried:
df = df.groupby(pd.Grouper(freq="M")).nlargest(2, "MV")
But I got the error: AttributeError: 'DataFrameGroupBy' object has no attribute 'nlargest' In addition, the value of n will need to be a different value for every month. I am not sure how to handle that as well.
CodePudding user response:
You need to pass apply
df = df.groupby(pd.Grouper(freq="M")).apply(lambda x : x.nlargest(2, "MV"))
CodePudding user response:
df.groupby('month').apply(lambda monthly_data: monthly_data[monthly_data['MV'] >= monthly_data['MV'].median())
CodePudding user response:
You need to use apply
and compute the final length:
df.groupby(pd.Grouper(freq="M")).apply(lambda g: g.nlargest(len(g)//2, 'MV'))
with the provided data as string:
(df.groupby('month', as_index=False, group_keys=False) # use pd.Grouper if needed here
.apply(lambda g: g.nlargest(len(g)//2, 'MV'))
.sort_values(by=['month', 'MV'])
)
output:
month stock MV
2 1994-07 C 70
3 1994-07 D 80
7 1994-08 D 95
8 1994-08 E 100
9 1994-08 F 110
CodePudding user response:
You can sorting values and then select 50% of length of values in lambda function:
df = (df.sort_values(['month','MV'])
.groupby(pd.Grouper(freq="M"), group_keys=False)
.apply(lambda x: x.head(int(len(x)) // 2)))