Home > OS >  Pandas Top N% of values within each group while ignoring missing values
Pandas Top N% of values within each group while ignoring missing values

Time:04-16

I have a df and I want, for each line (which correspond to a month), to have a list of 50% of the highest values of "B" in that line.

month A B
1994-07 A 50
1994-07 B 60
1994-07 C 70
1994-07 D 80
1994-07 E NAN
1994-07 F NAN
1994-08 A 90
1994-08 B 60
1994-08 C 70
1994-08 D 95
1994-08 E 100
1994-08 F 110
1994-08 G NAN

For July/1994 I only have 4 columns "B" filled with values, so 50% will be the 2 highest MV. For the month after, I have 6 stocks, which gives me 3 highest values:

month A B
1994-07 C 70
1994-07 D 80
1994-08 D 95
1994-08 E 100
1994-08 F 110

I have tried:

df = df.groupby(pd.Grouper(freq="M")).apply(lambda g: g.nsmallest(len(g)//2, 'B'))

However, it does not ignore "NAN" and count it as a number. For example, for July/1994 it counts 6 values, and so it returns me the 3 (50% of 6) the highest values from the month. Instead, it should count that there are 4 values and return me the 2 highest.

CodePudding user response:

IIUC, you want to use g['B'].count() as count ignores NaNs:

(df
 .groupby('month')
 .apply(lambda g: g.nlargest(g['B'].count()//2, columns='B'))
 .droplevel(0).sort_index()
)

Alternatively, dropna first:

(df
 .dropna(subset='B')
 .groupby('month')
 .apply(lambda g: g.nlargest(len(g)//2, 'B'))
 .droplevel(0).sort_index()
)

output:

      month  A      B
2   1994-07  C   70.0
3   1994-07  D   80.0
9   1994-08  D   95.0
10  1994-08  E  100.0
11  1994-08  F  110.0

CodePudding user response:

Alternatively, you could pass median to groupby.transform, then filter values greater than the median (i.e. top 50%). Since median method skips NaN by default, no issues there.

out = df[df['B'] > df.groupby('month')['B'].transform('median')]

Output:

      month  A      B
2   1994-07  C   70.0
3   1994-07  D   80.0
9   1994-08  D   95.0
10  1994-08  E  100.0
11  1994-08  F  110.0
  • Related