Home > Mobile >  How to create 2 new column in DataFrame based on the highest values in rest of column with appropria
How to create 2 new column in DataFrame based on the highest values in rest of column with appropria

Time:01-20

I have Pandas DataFrame like below (I can add that my DataFrame is definitely bigger, so I need to do below aggregation only for selected columns):

ID   | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111  | 10          | 10          | 320       | 120
222  | 15          | 80          | 500       | 500
333  | 0           | 0           | 110       | 350
444  | 20          | 5           | 0         | 0
555  | 0           | 0           | 0         | 0
666  | 10          | 20          | 30        | 50

Requirements:

  • I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,

    • if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
  • I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,

    • if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
  • If there is 0 in both columns with prefix COUNT_ then give NaN in column TOP_COUNT

  • If there is 0 in both columns with prefix SUM_ then give NaN in column TOP_SUM

Desire output:

ID   | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B  | TOP_COUNT   | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111  | 10          | 10          | 320       | 120        | COUNT_COL_A | SUM_COL_A 
222  | 15          | 80          | 500       | 500        | COUNT_COL_B | SUM_COL_B  
333  | 0           | 0           | 110       | 350        | NaN         | SUM_COL_B  
444  | 20          | 5           | 0         | 0          | COUNT_COL_A | NaN
555  | 0           | 0           | 0         | 0          | NaN         | NaN
666  | 10          | 20          | 60        | 50         | COUNT_COL_B | SUM_COL_A

How can i do that in Python Pandas ?

CodePudding user response:

You can use idxmax function as follows:

df['TOP_COUNT'] = df[['COUNT_COL_A' , 'COUNT_COL_B']].idxmax(axis="columns")
df['TOP_SUM'] = df[[' SUM_COL_A','SUM_COL_B']].idxmax(axis="columns")

df.loc[(df[['COUNT_COL_A' , 'COUNT_COL_B']]==0).all(axis=1), 'TOP_COUNT'] = pd.NA
df.loc[(df[['SUM_COL_A','SUM_COL_B']]==0).all(axis=1), 'TOP_SUM'] = pd.NA
  • Related