Home > Software engineering >  Add sum columns back to pandas dataframe through chain style
Add sum columns back to pandas dataframe through chain style

Time:05-05

I am trying to add a list of column sums back to my pandas dataframe through assign(). But I am not so sure how to do it when there's more than one column. What's the best way to do it or any other way to do it in a chain style given I have had other steps before?

data2.assign(data2[rate_name].abs() / data2.groupby(level = 'date')[rate_name].transform('sum'))
                   rate_water  rate_fire  rate_wood
id     date                                        
apple  2019-01-01   -0.500000  -0.500000   0.000000
orange 2019-01-01   -0.636364  -0.963636   3.000000
melon  2019-01-01   -0.333333   5.666667  27.888889
apple  2020-01-01   -0.100000   7.900000  76.000000
orange 2020-01-01    0.363636  -0.963636  26.500000
melon  2020-01-01    0.166667   6.166667  27.235043
apple  2021-01-01    0.328571  26.261702  84.220779
orange 2021-01-01    0.363636  28.036364  28.683673
melon  2021-01-01    0.611111  39.944444  27.679487

Reproducible:

from pandas import Timestamp
data2 = pd.DataFrame.from_dict({'rate_water': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.6363636363636364, ('melon', Timestamp('2019-01-01 00:00:00')): -0.33333333333333337, ('apple', Timestamp('2020-01-01 00:00:00')): -0.10000000000000009, ('orange', Timestamp('2020-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2020-01-01 00:00:00')): 0.16666666666666663, ('apple', Timestamp('2021-01-01 00:00:00')): 0.3285714285714285, ('orange', Timestamp('2021-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2021-01-01 00:00:00')): 0.611111111111111}, 'rate_fire': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2019-01-01 00:00:00')): 5.666666666666667, ('apple', Timestamp('2020-01-01 00:00:00')): 7.9, ('orange', Timestamp('2020-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2020-01-01 00:00:00')): 6.166666666666667, ('apple', Timestamp('2021-01-01 00:00:00')): 26.261702127659575, ('orange', Timestamp('2021-01-01 00:00:00')): 28.036363636363635, ('melon', Timestamp('2021-01-01 00:00:00')): 39.94444444444444}, 'rate_wood': {('apple', Timestamp('2019-01-01 00:00:00')): 0.0, ('orange', Timestamp('2019-01-01 00:00:00')): 3.0, ('melon', Timestamp('2019-01-01 00:00:00')): 27.88888888888889, ('apple', Timestamp('2020-01-01 00:00:00')): 76.0, ('orange', Timestamp('2020-01-01 00:00:00')): 26.5, ('melon', Timestamp('2020-01-01 00:00:00')): 27.235042735042736, ('apple', Timestamp('2021-01-01 00:00:00')): 84.22077922077922, ('orange', Timestamp('2021-01-01 00:00:00')): 28.683673469387756, ('melon', Timestamp('2021-01-01 00:00:00')): 27.67948717948718}})
                   rate_water  rate_fire  rate_wood  sum_water  sum_fire    sum_wood
id     date                                                                         
apple  2019-01-01   -0.500000  -0.500000   0.000000  -1.469697   4.20303   30.888889
orange 2019-01-01   -0.636364  -0.963636   3.000000  -1.469697   4.20303   30.888889
melon  2019-01-01   -0.333333   5.666667  27.888889  -1.469697   4.20303   30.888889
apple  2020-01-01   -0.100000   7.900000  76.000000   0.430303  13.10303  129.735043
orange 2020-01-01    0.363636  -0.963636  26.500000   0.430303  13.10303  129.735043
melon  2020-01-01    0.166667   6.166667  27.235043   0.430303  13.10303  129.735043
apple  2021-01-01    0.328571  26.261702  84.220779   1.303319  94.24251  140.583940
orange 2021-01-01    0.363636  28.036364  28.683673   1.303319  94.24251  140.583940
melon  2021-01-01    0.611111  39.944444  27.679487   1.303319  94.24251  140.583940

CodePudding user response:

Use dictionary comprehension for dict of Series and add it to dataFrame with unpack **:

from pandas import Timestamp
data2 = pd.DataFrame.from_dict({'rate_water': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.6363636363636364, ('melon', Timestamp('2019-01-01 00:00:00')): -0.33333333333333337, ('apple', Timestamp('2020-01-01 00:00:00')): -0.10000000000000009, ('orange', Timestamp('2020-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2020-01-01 00:00:00')): 0.16666666666666663, ('apple', Timestamp('2021-01-01 00:00:00')): 0.3285714285714285, ('orange', Timestamp('2021-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2021-01-01 00:00:00')): 0.611111111111111}, 'rate_fire': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2019-01-01 00:00:00')): 5.666666666666667, ('apple', Timestamp('2020-01-01 00:00:00')): 7.9, ('orange', Timestamp('2020-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2020-01-01 00:00:00')): 6.166666666666667, ('apple', Timestamp('2021-01-01 00:00:00')): 26.261702127659575, ('orange', Timestamp('2021-01-01 00:00:00')): 28.036363636363635, ('melon', Timestamp('2021-01-01 00:00:00')): 39.94444444444444}, 'rate_wood': {('apple', Timestamp('2019-01-01 00:00:00')): 0.0, ('orange', Timestamp('2019-01-01 00:00:00')): 3.0, ('melon', Timestamp('2019-01-01 00:00:00')): 27.88888888888889, ('apple', Timestamp('2020-01-01 00:00:00')): 76.0, ('orange', Timestamp('2020-01-01 00:00:00')): 26.5, ('melon', Timestamp('2020-01-01 00:00:00')): 27.235042735042736, ('apple', Timestamp('2021-01-01 00:00:00')): 84.22077922077922, ('orange', Timestamp('2021-01-01 00:00:00')): 28.683673469387756, ('melon', Timestamp('2021-01-01 00:00:00')): 27.67948717948718}})
data2.index.names=['id','date']

cols = ['rate_water','rate_fire','rate_wood']
data2 = data2.assign(**{rate_name.replace('rate','sum'): 
                        data2[rate_name].abs() / data2.groupby(level = 'date')[rate_name].transform('sum') 
                        for rate_name in cols})
print (data2)
                   rate_water  rate_fire  rate_wood  sum_water  sum_fire  \
id     date                                                                
apple  2019-01-01   -0.500000  -0.500000   0.000000  -0.340206  0.118962   
orange 2019-01-01   -0.636364  -0.963636   3.000000  -0.432990  0.229272   
melon  2019-01-01   -0.333333   5.666667  27.888889  -0.226804  1.348234   
apple  2020-01-01   -0.100000   7.900000  76.000000   0.232394  0.602914   
orange 2020-01-01    0.363636  -0.963636  26.500000   0.845070  0.073543   
melon  2020-01-01    0.166667   6.166667  27.235043   0.387324  0.470629   
apple  2021-01-01    0.328571  26.261702  84.220779   0.252104  0.278661   
orange 2021-01-01    0.363636  28.036364  28.683673   0.279008  0.297492   
melon  2021-01-01    0.611111  39.944444  27.679487   0.468888  0.423847   

                   sum_wood  
id     date                  
apple  2019-01-01  0.000000  
orange 2019-01-01  0.097122  
melon  2019-01-01  0.902878  
apple  2020-01-01  0.585809  
orange 2020-01-01  0.204262  
melon  2020-01-01  0.209928  
apple  2021-01-01  0.599078  
orange 2021-01-01  0.204032  
melon  2021-01-01  0.196889  

Another way is processing all columns together:

cols = ['rate_water','rate_fire','rate_wood']
data2 = data2.join(data2[cols].abs().div(data2.groupby(level = 'date')[cols].transform('sum') )
                     .rename(columns=lambda x: x.replace('rate','sum')))

cols = ['rate_water','rate_fire','rate_wood']
data2 = data2.assign(**data2[cols].abs().div(data2.groupby(level = 'date')[cols].transform('sum') )
                       .rename(columns=lambda x: x.replace('rate','sum')))
print (data2)
                   rate_water  rate_fire  rate_wood  sum_water  sum_fire  \
id     date                                                                
apple  2019-01-01   -0.500000  -0.500000   0.000000  -0.340206  0.118962   
orange 2019-01-01   -0.636364  -0.963636   3.000000  -0.432990  0.229272   
melon  2019-01-01   -0.333333   5.666667  27.888889  -0.226804  1.348234   
apple  2020-01-01   -0.100000   7.900000  76.000000   0.232394  0.602914   
orange 2020-01-01    0.363636  -0.963636  26.500000   0.845070  0.073543   
melon  2020-01-01    0.166667   6.166667  27.235043   0.387324  0.470629   
apple  2021-01-01    0.328571  26.261702  84.220779   0.252104  0.278661   
orange 2021-01-01    0.363636  28.036364  28.683673   0.279008  0.297492   
melon  2021-01-01    0.611111  39.944444  27.679487   0.468888  0.423847   

                   sum_wood  
id     date                  
apple  2019-01-01  0.000000  
orange 2019-01-01  0.097122  
melon  2019-01-01  0.902878  
apple  2020-01-01  0.585809  
orange 2020-01-01  0.204262  
melon  2020-01-01  0.209928  
apple  2021-01-01  0.599078  
orange 2021-01-01  0.204032  
melon  2021-01-01  0.196889  

CodePudding user response:

One option is with the assign method and unpacking:

 data2.assign(**data2
                .abs()
                .div(
                data2.groupby('date')
                .transform('sum'))
                .rename(columns = lambda df: df.removeprefix('rate_'))
                .add_prefix('sum_'))

                    rate_water  rate_fire  rate_wood  sum_water  sum_fire  sum_wood
id     date
apple  2019-01-01   -0.500000  -0.500000   0.000000  -0.340206  0.118962  0.000000
orange 2019-01-01   -0.636364  -0.963636   3.000000  -0.432990  0.229272  0.097122
melon  2019-01-01   -0.333333   5.666667  27.888889  -0.226804  1.348234  0.902878
apple  2020-01-01   -0.100000   7.900000  76.000000   0.232394  0.602914  0.585809
orange 2020-01-01    0.363636  -0.963636  26.500000   0.845070  0.073543  0.204262
melon  2020-01-01    0.166667   6.166667  27.235043   0.387324  0.470629  0.209928
apple  2021-01-01    0.328571  26.261702  84.220779   0.252104  0.278661  0.599078
orange 2021-01-01    0.363636  28.036364  28.683673   0.279008  0.297492  0.204032
melon  2021-01-01    0.611111  39.944444  27.679487   0.468888  0.423847  0.196889

Another option would be to concatenate along axis=1:

 pd.concat([data2,
           data2.abs()
                .div(groupby('date')
                .transform('sum'))
                .rename(columns = lambda df: df.removeprefix('rate_'))
                .add_prefix('sum_')], 
           axis = 1)
Out[103]:
                    rate_water  rate_fire  rate_wood  sum_water  sum_fire  sum_wood
id     date
apple  2019-01-01   -0.500000  -0.500000   0.000000  -0.340206  0.118962  0.000000
orange 2019-01-01   -0.636364  -0.963636   3.000000  -0.432990  0.229272  0.097122
melon  2019-01-01   -0.333333   5.666667  27.888889  -0.226804  1.348234  0.902878
apple  2020-01-01   -0.100000   7.900000  76.000000   0.232394  0.602914  0.585809
orange 2020-01-01    0.363636  -0.963636  26.500000   0.845070  0.073543  0.204262
melon  2020-01-01    0.166667   6.166667  27.235043   0.387324  0.470629  0.209928
apple  2021-01-01    0.328571  26.261702  84.220779   0.252104  0.278661  0.599078
orange 2021-01-01    0.363636  28.036364  28.683673   0.279008  0.297492  0.204032
melon  2021-01-01    0.611111  39.944444  27.679487   0.468888  0.423847  0.196889

I feel though that it would probably be cleaner to create temporary variables instead, especially if this is going to be in production code

  • Related