Home > OS >  scale within groups with scikit-learn
scale within groups with scikit-learn

Time:01-21

from sklearn.preprocessing import scale

df = pd.DataFrame({'x':['a','a','a','a','b','b','b','b'], 'y':[1,1,2,2,1,1,2,2], 'z':[12,32,14,64,24,67,44,33]})

I'm trying to scale column z for each combination of x and y:

    x  y  z   z2
0   a  1  1  -1.22
1   a  1  2   0
2   a  1  3   1.22

3   a  2  3  -1.07
4   a  2  4  -0.27
5   a  2  6   1.34

6   b  1  4   0.71
7   b  1  2  -1.41
8   b  1  4   0.71

9   b  2  6   1.34
10  b  2  4  -0.27
11  b  2  3  -1.07

I tried

df['z2'] = df.groupby(['x','y'])['z'].apply(scale)

but this returns the error

TypeError: incompatible index of inserted column with frame index

CodePudding user response:

A possible solution is to use pandas.DataFrame.transform:

df['z3'] = df.groupby(['x', 'y'])['z'].transform(scale)

Output:

    x  y  z    z2        z3
0   a  1  1 -1.22 -1.224745
1   a  1  2  0.00  0.000000
2   a  1  3  1.22  1.224745
3   a  2  3 -1.07 -1.069045
4   a  2  4 -0.27 -0.267261
5   a  2  6  1.34  1.336306
6   b  1  4  0.71  0.707107
7   b  1  2 -1.41 -1.414214
8   b  1  4  0.71  0.707107
9   b  2  6  1.34  1.336306
10  b  2  4 -0.27 -0.267261
11  b  2  3 -1.07 -1.069045

EXPLATION OF THE ERROR YOU GOT

The instruction

df.groupby(['x','y'])['z'].apply(scale)

produces the following output:

x  y
a  1         [-1.224744871391589, 0.0, 1.224744871391589]
   2    [-1.0690449676496974, -0.2672612419124242, 1.3...
b  1    [0.7071067811865474, -1.4142135623730951, 0.70...
   2    [1.3363062095621223, -0.2672612419124242, -1.0...
Name: z, dtype: object

Given the discrepancy between the index of df and the index of the dataframe above (it is a MultiIndex), the inconsistency emerges when running

df['z2'] = df.groupby(['x','y'])['z'].apply(scale)

which causes the error.

  • Related