I have a df and I want, for each line (which correspond to a month), to have a list of 50% of the highest values of "B" in that line.
month | A | B |
---|---|---|
1994-07 | A | 50 |
1994-07 | B | 60 |
1994-07 | C | 70 |
1994-07 | D | 80 |
1994-07 | E | NAN |
1994-07 | F | NAN |
1994-08 | A | 90 |
1994-08 | B | 60 |
1994-08 | C | 70 |
1994-08 | D | 95 |
1994-08 | E | 100 |
1994-08 | F | 110 |
1994-08 | G | NAN |
For July/1994 I only have 4 columns "B" filled with values, so 50% will be the 2 highest MV. For the month after, I have 6 stocks, which gives me 3 highest values:
month | A | B |
---|---|---|
1994-07 | C | 70 |
1994-07 | D | 80 |
1994-08 | D | 95 |
1994-08 | E | 100 |
1994-08 | F | 110 |
I have tried:
df = df.groupby(pd.Grouper(freq="M")).apply(lambda g: g.nsmallest(len(g)//2, 'B'))
However, it does not ignore "NAN" and count it as a number. For example, for July/1994 it counts 6 values, and so it returns me the 3 (50% of 6) the highest values from the month. Instead, it should count that there are 4 values and return me the 2 highest.
CodePudding user response:
IIUC, you want to use g['B'].count()
as count
ignores NaNs:
(df
.groupby('month')
.apply(lambda g: g.nlargest(g['B'].count()//2, columns='B'))
.droplevel(0).sort_index()
)
Alternatively, dropna
first:
(df
.dropna(subset='B')
.groupby('month')
.apply(lambda g: g.nlargest(len(g)//2, 'B'))
.droplevel(0).sort_index()
)
output:
month A B
2 1994-07 C 70.0
3 1994-07 D 80.0
9 1994-08 D 95.0
10 1994-08 E 100.0
11 1994-08 F 110.0
CodePudding user response:
Alternatively, you could pass median
to groupby.transform
, then filter values greater than the median (i.e. top 50%). Since median
method skips NaN by default, no issues there.
out = df[df['B'] > df.groupby('month')['B'].transform('median')]
Output:
month A B
2 1994-07 C 70.0
3 1994-07 D 80.0
9 1994-08 D 95.0
10 1994-08 E 100.0
11 1994-08 F 110.0