Home > Back-end >  How to do aggregation based on 3 binary columns and range column to calculate percentage participati
How to do aggregation based on 3 binary columns and range column to calculate percentage participati

Time:10-15

I have DataFrame in Python Pandas like below:

ID U1 U2 U3 CP CH
111 1 1 0 10-20 1
222 1 0 1 10-20 1
333 0 1 0 20-30 0
444 0 1 1 40-50 0
555 1 0 0 10-20 0

And I need to create column with percent of '1' in column 'CH' per combination for: U1/U2/U3 and CP, so as a result i need something like below:

   | idx | idx     | CH_perc   | 
   | ----|---------|-----------|
   |U1   |CP_10_20 | 0.66      | 3 IDs have '1' in U1 and CP = 10-20, and 2 of them have '1' in CH, so 2/3 = 0.66  
   |U1   |CP_20_30 | 0         | 
   |U1   |CP_30_40 | 0         |
   |U1   |CP_40_50 | 0         | 
   |U2   |CP_10_20 | 1.00      | 1 ID have '1' in U2 and CP = 10-20, and have '1' in CH, so 1/1/ = 1.00
   |U2   |CP_20_30 | 0         | 
   |U2   |CP_30_40 | 0         | 
   |U2   |CP_40_50 | 0         | 
   |U3   |CP_10_20 | 1.00      | 1 ID have '1' in U3 and CP = 10-20, and have '1' in CH, so 1/1/ = 1.00
   |U3   |CP_20_30 | 0         | 
   |U3   |CP_30_40 | 0         | 
   |U3   |CP_40_50 | 0         | 

How can I od that in Python Pandas ?

CodePudding user response:

You can use a melt and groupby.sum based approach:

(df.drop(columns='ID')
   .melt(['CP', 'CH'], var_name='idx')
   # keep only CH where value is 1
   .assign(CH=lambda d: d['CH'].mul(d['value']))
   .groupby(['idx', 'CP'], as_index=False).sum()
   .assign(CH_perc=lambda d: d.pop('CH').div(d.pop('value')).fillna(0))
)

output:

  idx     CP   CH_perc
0  U1  10-20  0.666667
1  U1  20-30  0.000000
2  U1  40-50  0.000000
3  U2  10-20  1.000000
4  U2  20-30  0.000000
5  U2  40-50  0.000000
6  U3  10-20  1.000000
7  U3  20-30  0.000000
8  U3  40-50  0.000000
  • Related