I have Pandas DataFrame like below (I can add that my DataFrame is definitely bigger, so I need to do below aggregation only for selected columns):
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111 | 10 | 10 | 320 | 120
222 | 15 | 80 | 500 | 500
333 | 0 | 0 | 110 | 350
444 | 20 | 5 | 0 | 0
555 | 0 | 0 | 0 | 0
666 | 10 | 20 | 30 | 50
Requirements:
I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,
- if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
- if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
If there is 0 in both columns with prefix COUNT_ then give NaN in column TOP_COUNT
If there is 0 in both columns with prefix SUM_ then give NaN in column TOP_SUM
Desire output:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B | TOP_COUNT | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111 | 10 | 10 | 320 | 120 | COUNT_COL_A | SUM_COL_A
222 | 15 | 80 | 500 | 500 | COUNT_COL_B | SUM_COL_B
333 | 0 | 0 | 110 | 350 | NaN | SUM_COL_B
444 | 20 | 5 | 0 | 0 | COUNT_COL_A | NaN
555 | 0 | 0 | 0 | 0 | NaN | NaN
666 | 10 | 20 | 60 | 50 | COUNT_COL_B | SUM_COL_A
How can i do that in Python Pandas ?
CodePudding user response:
You can use idxmax
function as follows:
df['TOP_COUNT'] = df[['COUNT_COL_A' , 'COUNT_COL_B']].idxmax(axis="columns")
df['TOP_SUM'] = df[[' SUM_COL_A','SUM_COL_B']].idxmax(axis="columns")
df.loc[(df[['COUNT_COL_A' , 'COUNT_COL_B']]==0).all(axis=1), 'TOP_COUNT'] = pd.NA
df.loc[(df[['SUM_COL_A','SUM_COL_B']]==0).all(axis=1), 'TOP_SUM'] = pd.NA