Home > Back-end >  Pandas: New column value based on the matching multi-level column's conditions
Pandas: New column value based on the matching multi-level column's conditions

Time:10-06

I have the following dataframe with multi-level columns

In [1]: data = {('A', '10'):[1,3,0,1],
                ('A', '20'):[3,2,0,0],
                ('A', '30'):[0,0,3,0],
                ('B', '10'):[3,0,0,0],
                ('B', '20'):[0,5,0,0],
                ('B', '30'):[0,0,1,0],
                ('C', '10'):[0,0,0,2],
                ('C', '20'):[1,0,0,0],
                ('C', '30'):[0,0,0,0]
                }
        df = pd.DataFrame(data)
        df
Out[1]:
   A        B        C
  10 20 30 10 20 30 10 20 30
0  1  3  0  3  0  0  0  1  0
1  3  2  0  0  5  0  0  0  0
2  0  0  3  0  0  1  0  0  0
3  1  0  0  0  0  0  2  0  0

In a new column results I want to return the combined column name containing the maximum value for each subset (i.e. second level column)

My desired output should look like the below

Out[2]:
   A        B        C
  10 20 30 10 20 30 10 20 30      results
0  1  3  0  3  0  0  0  1  0  A20&B10&C20
1  3  2  0  0  5  0  0  0  0      A10&B20
2  0  0  3  0  0  1  0  0  0      A30&B30
3  1  0  0  0  0  0  2  0  0      A10&C10

For example the first row:

For column 'A' the max value is under column '20' & for column 'B' there is only 1 value under '10' & for column 'C' also it is only one value under '20' & so the result would be A20&B10&C20

Edit: replacing " " with "&" in the results column, apparently I was misunderstood and you guys thought I need the summation while I need to column names separated by a separator

CodePudding user response:

Try:

df["results"] = df.groupby(level=0, axis=1).max().sum(1)
print(df)

Prints:

   A        B        C       results
  10 20 30 10 20 30 10 20 30        
0  1  3  0  3  0  0  0  1  0       7
1  3  2  0  0  5  0  0  0  0       8
2  0  0  3  0  0  1  0  0  0       4
3  1  0  0  0  0  0  2  0  0       3

CodePudding user response:

  • Group by level 0 and axis=1

  • You use idxmax to get max sub-level indexes as tuples (while skipping NaNs).

  • Apply function to rows (axix-1) to concat names

  • In function (that you apply to rows), Iterate on keys/columns and concatenate the column levels. Replace Nan (which have type 'float') with an empty string and filter them later.

You won't need df.replace(0, np.nan) if you initially have NaN and let them remain.

map_res = lambda x:  ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0] x[a][1]) for a in x.keys()])))

df['results'] = df.replace(0, np.nan)\
                  .groupby(level=0, axis=1)\
                  .idxmax(skipna = True)\
                  .apply(map_res,axis=1)

Here's output

    A       B           C               results
10  20  30  10  20  30  10  20  30  
0   1   3   0   3   0   0   0   1   0   A20,B10,C20
1   3   2   0   0   5   0   0   0   0   A10,B20
2   0   0   3   0   0   1   0   0   0   A30,B30
3   1   0   0   0   0   0   2   0   0   A10,C10
  • Related