Home > database >  Is it possible to compute nested sums using pandas and groupby?
Is it possible to compute nested sums using pandas and groupby?

Time:11-09

Suppose I have a dataframe with two numeric columns and one categorical column:

dftest=pd.DataFrame({
    "tau":[1,2,5,1,2,5],
    "x"  :[4,5,6,7,8,9],
    "cat":list("aaabbb")
})

Output:

enter image description here

I can calculate the sum of squares of x for each cat easily:

dftest.groupby(["cat"]).apply(
    lambda s: pd.Series({
        "sum_x^2":(s["x"]**2).sum()
    })
)

But now my question: instead of Sum_i (x_i^2) how do I calculate Sum_{i,j} rho(tau_i, tau_j) * x_i * x_j, for some square matrix rho with the correct dimensions and row/column indices? As a concrete example, suppose I have the following rho:

rho = pd.DataFrame({
    1:[1, 0.9, 0.8],
    2:[0.9, 1, 0.7],
    5:[0.8, 0.7, 1]
}).T
rho = rho.rename(columns={0:1, 1:2, 2:5})

enter image description here

For category a, I would like to calculate:

rho[1,1] * x[1] * x[1]   rho[1,2] * x[1] * x[2]   rho[1,5] * x[1] * x[5]   ...
... rho[2,1] * x[2] * x[1]   ...

Or as per the example given:

1*4*4   0.9*4*5   0.8*4*6   ...
0.9*5*4   ...

Currently I'm doing this by looping through the groupby objects and looping over the index values in rho, but I'm wondering if I'm missing a more elegant way.

CodePudding user response:

First is pivoted dftest for columns by groups 1,2,5:

df = dftest.pivot('cat','tau','x')
print (df)
tau  1  2  5
cat         
a    4  5  6
b    7  8  9

Then is reshaped rho for MultiIndex Series by DataFrame.stack:

s = rho.stack()
print (s)
1  1    1.0
   2    0.9
   5    0.8
2  1    0.9
   2    1.0
   5    0.7
5  1    0.8
   2    0.7
   5    1.0
dtype: float64

Repeated values by DataFrame.reindex in different levels:

df1 = df.reindex(s.index, level=0, axis=1)
print (df1)
     1        2        5      
     1  2  5  1  2  5  1  2  5
cat                           
a    4  4  4  5  5  5  6  6  6
b    7  7  7  8  8  8  9  9  9

df2 = df.reindex(s.index, level=1, axis=1)
print (df2)
     1        2        5      
     1  2  5  1  2  5  1  2  5
cat                           
a    4  5  6  4  5  6  4  5  6
b    7  8  9  7  8  9  7  8  9

Last are multiple all together with sum:

out = df1.mul(df2).mul(s).sum(axis=1)
print (out)
cat
a    193.4
b    496.4
dtype: float64
  • Related