I have Pandas DataFrame like below (I can add that my DataFrame is definitely bigger, so I need to do below aggregation only for selected columns):
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111 | 10 | 10 | 320 | 120
222 | 15 | 80 | 500 | 500
333 | 0 | 0 | 110 | 350
444 | 20 | 5 | 0 | 0
555 | 0 | 0 | 0 | 0
666 | 10 | 20 | 60 | 50
Requirements:
I need to create new column "TOP_COUNT_2" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,
- if some ID has same values in all "COUNT_" columns take to "TOP_COUNT_2" all columns names with prefix "COUNT_" mentioned after the decimal point
I need to create new column "TOP_SUM_2" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
- if some ID has same values in all "SUM_" columns take to "TOP_SUM_2" all columns names with prefix "COUNT_" mentioned after the decimal point
If there is 0 in both columns with prefix COUNT_ then give NaN in column TOP_COUNT
If there is 0 in both columns with prefix SUM_ then give NaN in column TOP_SUM
Desire output:
ID | CONT_COL_A | CNT_COL_B | SUM_COL_A | SUM_COL_B | TOP_COUNT_2 | TOP_SUM_2
-----|-------------|-------------|-----------|------------|----------------------|-----------
111 | 10 | 10 | 320 | 120 | CNT_COL_A, CNT_COL_B | SUM_COL_A
222 | 15 | 80 | 500 | 500 | COUNT_COL_B | SUM_COL_A, SUM_COL_B
333 | 0 | 0 | 110 | 350 | NaN | SUM_COL_B
444 | 20 | 5 | 0 | 0 | COUNT_COL_A | NaN
555 | 0 | 0 | 0 | 0 | NaN | NaN
666 | 10 | 20 | 60 | 50 | COUNT_COL_B | SUM_COL_A
How can i do that in Python Pandas ?
CodePudding user response:
First create mask for processing only non 0
only rows by DataFrame.any
with Series.ne
with boolean indexing
and DataFrame.loc
, then compare by maximal values and for join column names use DataFrame.dot
trick with columns names with separator:
cols1 = ['COUNT_COL_A' , 'COUNT_COL_B']
cols2 = ['SUM_COL_A','SUM_COL_B']
m1 = df[cols1].ne(0).any(axis=1)
m2 = df[cols2].ne(0).any(axis=1)
df1 = df.loc[m1, cols1]
df2 = df.loc[m2, cols2]
df['TOP_COUNT_2'] = df1.eq(df1.max(axis=1), axis=0).dot(df1.columns ', ').str[:-2]
df['TOP_SUM_2'] = df2.eq(df2.max(axis=1), axis=0).dot(df2.columns ', ').str[:-2]
print (df)
ID COUNT_COL_A COUNT_COL_B SUM_COL_A SUM_COL_B \
0 111 10 10 320 120
1 222 15 80 500 500
2 333 0 0 110 350
3 444 20 5 0 0
4 555 0 0 0 0
5 666 10 20 60 50
TOP_COUNT_2 TOP_SUM_2
0 COUNT_COL_A, COUNT_COL_B SUM_COL_A
1 COUNT_COL_B SUM_COL_A, SUM_COL_B
2 NaN SUM_COL_B
3 COUNT_COL_A NaN
4 NaN NaN
5 COUNT_COL_B SUM_COL_A