Home > OS >  min/max value of a column based on values of another column, grouped by and transformed in pandas
min/max value of a column based on values of another column, grouped by and transformed in pandas

Time:08-25

I'd like to know if I can do all this in one line, rather than multiple lines.

my dataframe:

    import pandas as pd
df = pd.DataFrame({'ID' : [1,1,1,1,1,1,2,2,2,2,2,2]
    ,'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan]
    , 'B': [0,1,1,0,1,1,1,1,1,0,1,0]
    , 'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]})
df

    ID  A       B   desired_output
0   1   1.0     0   5
1   1   2.0     1   5
2   1   3.0     1   5
3   1   10.0    0   5
4   1   NaN     1   5
5   1   5.0     1   5
6   2   20.0    1   20
7   2   6.0     1   20
8   2   7.0     1   20
9   2   NaN     0   20
10  2   NaN     1   20
11  2   NaN     0   20

I'm trying to find the maximum value of column A, for values of column B == 1, group by column ID, and transform the results directly so that the value is back in the dataframe without extra merging et al.

something like the following (but without getting errors!)

df['desired_output'] =  df.groupby('ID').A.where(df.B == 1).transform('max')  ## this gives error

The max function should ignore the NaNs as well. I wonder if I'm trying too much in one line, but one can hope there is a way for a beautiful code.

EDIT: I can get a very similar output by changing the where clause:

df['desired_output'] =  df.where(df.B == 1).groupby('ID').A.transform('max') ## this works but output is not what i want

but the output is not exactly what I want. desired_output should not have any NaN, unless all values of A are NaN for when B == 1.

CodePudding user response:

Here is a way to do it:

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'ID' : [1,1,1,1,1,1,2,2,2,2,2,2],
        'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan], 
        'B': [0,1,1,0,1,1,1,1,1,0,1,0], 
        'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]
    })

df['output'] = df[df.B == 1].groupby('ID').A.max()[df.ID].reset_index().A

df

Result:


    ID     A  B  desired_output  output
0    1   1.0  0               5     5.0
1    1   2.0  1               5     5.0
2    1   3.0  1               5     5.0
3    1  10.0  0               5     5.0
4    1   NaN  1               5     5.0
5    1   5.0  1               5     5.0
6    2  20.0  1              20    20.0
7    2   6.0  1              20    20.0
8    2   7.0  1              20    20.0
9    2   NaN  0              20    20.0
10   2   NaN  1              20    20.0
11   2   NaN  0              20    20.0

Decomposition:

df[df.B == 1]   # start by filtering on B
.groupby('ID')  # group by ID
.A.max()        # get max values in column A
[df.ID]         # recast the result on ID series shape
.reset_index()  # re-create the index (ID was the index after groupby)
.A              # return values of A (which are now max values of original A)

Important note: it relies on the fact that the index is as in the given example, that is, sorted, starting from 0, with a 1 increment. You will have to reset_index() of your DataFrame before this operation when this is not the case.

  • Related