Home > database >  Subset dataframe based on large values of a column per month
Subset dataframe based on large values of a column per month

Time:03-14

I have a df like that:

month stock MV
1994-07 A 50
1994-07 B 60
1994-07 C 70
1994-07 D 80
1994-08 A 90
1994-08 B 60
1994-08 C 70
1994-08 D 95
1994-08 E 100
1994-08 F 110

I would like to subset my df in a way that I only have in it the 50% of the highest MV per month. For July/1994 I only have 4 stock, so 50% will be the 2 highest MV. For the month after, I have 6 stocks, which gives me 3 highest values:

month stock MV
1994-07 C 70
1994-07 D 80
1994-08 D 95
1994-08 E 100
1994-08 F 110

I have tried:

df = df.groupby(pd.Grouper(freq="M")).nlargest(2, "MV")

But I got the error: AttributeError: 'DataFrameGroupBy' object has no attribute 'nlargest' In addition, the value of n will need to be a different value for every month. I am not sure how to handle that as well.

CodePudding user response:

You need to pass apply

df = df.groupby(pd.Grouper(freq="M")).apply(lambda x : x.nlargest(2, "MV"))

CodePudding user response:

df.groupby('month').apply(lambda monthly_data: monthly_data[monthly_data['MV'] >= monthly_data['MV'].median())

CodePudding user response:

You need to use apply and compute the final length:

df.groupby(pd.Grouper(freq="M")).apply(lambda g: g.nlargest(len(g)//2, 'MV'))

with the provided data as string:

(df.groupby('month', as_index=False, group_keys=False) # use pd.Grouper if needed here
   .apply(lambda g: g.nlargest(len(g)//2, 'MV'))
   .sort_values(by=['month', 'MV'])
)

output:

     month stock   MV
2  1994-07     C   70
3  1994-07     D   80
7  1994-08     D   95
8  1994-08     E  100
9  1994-08     F  110

CodePudding user response:

You can sorting values and then select 50% of length of values in lambda function:

df = (df.sort_values(['month','MV'])
        .groupby(pd.Grouper(freq="M"), group_keys=False)
        .apply(lambda x: x.head(int(len(x)) // 2)))
  • Related