I have the following dataframe with multi-level columns
In [1]: data = {('A', '10'):[1,3,0,1],
('A', '20'):[3,2,0,0],
('A', '30'):[0,0,3,0],
('B', '10'):[3,0,0,0],
('B', '20'):[0,5,0,0],
('B', '30'):[0,0,1,0],
('C', '10'):[0,0,0,2],
('C', '20'):[1,0,0,0],
('C', '30'):[0,0,0,0]
}
df = pd.DataFrame(data)
df
Out[1]:
A B C
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0
1 3 2 0 0 5 0 0 0 0
2 0 0 3 0 0 1 0 0 0
3 1 0 0 0 0 0 2 0 0
In a new column results
I want to return the combined column name containing the maximum value for each subset (i.e. second level column)
My desired output should look like the below
Out[2]:
A B C
10 20 30 10 20 30 10 20 30 results
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
For example the first row:
For column 'A' the max value is under column '20' &
for column 'B' there is only 1 value under '10' &
for column 'C' also it is only one value under '20' &
so the result would be A20&B10&C20
Edit: replacing " " with "&" in the results
column, apparently I was misunderstood and you guys thought I need the summation while I need to column names separated by a separator
CodePudding user response:
Try:
df["results"] = df.groupby(level=0, axis=1).max().sum(1)
print(df)
Prints:
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 7
1 3 2 0 0 5 0 0 0 0 8
2 0 0 3 0 0 1 0 0 0 4
3 1 0 0 0 0 0 2 0 0 3
CodePudding user response:
Group by level 0 and axis=1
You use idxmax to get max sub-level indexes as tuples (while skipping NaNs).
Apply function to rows (axix-1) to concat names
In function (that you apply to rows), Iterate on keys/columns and concatenate the column levels. Replace Nan (which have type 'float') with an empty string and filter them later.
You won't need df.replace(0, np.nan) if you initially have NaN and let them remain.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0] x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.groupby(level=0, axis=1)\
.idxmax(skipna = True)\
.apply(map_res,axis=1)
Here's output
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20,B10,C20
1 3 2 0 0 5 0 0 0 0 A10,B20
2 0 0 3 0 0 1 0 0 0 A30,B30
3 1 0 0 0 0 0 2 0 0 A10,C10