Home > Mobile >  How to create new columns with name of columns in list with the highest value per ID, metioned after
How to create new columns with name of columns in list with the highest value per ID, metioned after

Time:01-20

I have Pandas DataFrame like below (I can add that my DataFrame is definitely bigger, so I need to do below aggregation only for selected columns):

ID   | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111  | 10          | 10          | 320       | 120
222  | 15          | 80          | 500       | 500
333  | 0           | 0           | 110       | 350
444  | 20          | 5           | 0         | 0
555  | 0           | 0           | 0         | 0
666  | 10          | 20          | 60        | 50

Requirements:

  • I need to create new column "TOP_COUNT_2" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,

    • if some ID has same values in all "COUNT_" columns take to "TOP_COUNT_2" all columns names with prefix "COUNT_" mentioned after the decimal point
  • I need to create new column "TOP_SUM_2" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,

    • if some ID has same values in all "SUM_" columns take to "TOP_SUM_2" all columns names with prefix "COUNT_" mentioned after the decimal point
  • If there is 0 in both columns with prefix COUNT_ then give NaN in column TOP_COUNT

  • If there is 0 in both columns with prefix SUM_ then give NaN in column TOP_SUM

Desire output:

ID   | CONT_COL_A  | CNT_COL_B   | SUM_COL_A | SUM_COL_B  | TOP_COUNT_2          | TOP_SUM_2   
-----|-------------|-------------|-----------|------------|----------------------|-----------
111  | 10          | 10          | 320       | 120        | CNT_COL_A, CNT_COL_B | SUM_COL_A
222  | 15          | 80          | 500       | 500        | COUNT_COL_B          | SUM_COL_A, SUM_COL_B
333  | 0           | 0           | 110       | 350        | NaN                  | SUM_COL_B
444  | 20          | 5           | 0         | 0          | COUNT_COL_A          | NaN
555  | 0           | 0           | 0         | 0          | NaN                  | NaN
666  | 10          | 20          | 60        | 50         | COUNT_COL_B          | SUM_COL_A

How can i do that in Python Pandas ?

CodePudding user response:

First create mask for processing only non 0 only rows by DataFrame.any with Series.ne with boolean indexing and DataFrame.loc, then compare by maximal values and for join column names use DataFrame.dot trick with columns names with separator:

cols1 = ['COUNT_COL_A' , 'COUNT_COL_B']
cols2 = ['SUM_COL_A','SUM_COL_B']

m1 = df[cols1].ne(0).any(axis=1)
m2 = df[cols2].ne(0).any(axis=1)

df1 = df.loc[m1, cols1]
df2 = df.loc[m2, cols2]

df['TOP_COUNT_2'] = df1.eq(df1.max(axis=1), axis=0).dot(df1.columns   ', ').str[:-2]
df['TOP_SUM_2'] = df2.eq(df2.max(axis=1), axis=0).dot(df2.columns   ', ').str[:-2]
print (df)
    ID  COUNT_COL_A  COUNT_COL_B  SUM_COL_A  SUM_COL_B  \
0  111           10           10        320        120   
1  222           15           80        500        500   
2  333            0            0        110        350   
3  444           20            5          0          0   
4  555            0            0          0          0   
5  666           10           20         60         50   

                TOP_COUNT_2             TOP_SUM_2  
0  COUNT_COL_A, COUNT_COL_B             SUM_COL_A  
1               COUNT_COL_B  SUM_COL_A, SUM_COL_B  
2                       NaN             SUM_COL_B  
3               COUNT_COL_A                   NaN  
4                       NaN                   NaN  
5               COUNT_COL_B             SUM_COL_A  
  • Related