Home > database >  Group by scikit-learn transformation
Group by scikit-learn transformation

Time:09-11

I have a scikit-learn MinMaxScaler feature engineering transformation, it works fine, but I need to combine it with group by, that is the min and the max needs to be taken by a specific column.

This is the code that works without group by:

from sklearn.preprocessing import MinMaxScaler

mydict = [{'a': 1, 'b': 2 },
          {'a': 1, 'b': 8 },
          {'a': 1, 'b': 5 },
          {'a': 2, 'b': 6 },
          {'a': 2, 'b': 1 },
          {'a': 2, 'b': 5 }]

df = pd.DataFrame(mydict)

scaler = MinMaxScaler(feature_range=(100,200))
df['x'] = scaler.fit_transform(df[['b']])
print(df)
 
    a   b   x
0   1   2   114.285714
1   1   8   200.000000
2   1   5   157.142857
3   2   6   171.428571
4   2   1   100.000000
5   2   5   157.142857

Note that the min happens when a = 2 and the max when a = 1. I need to group by a therefore the result should look something like this:

    a   b   x
0   1   2   100.000000
1   1   8   200.000000
2   1   5   157.142857
3   2   6   200.000000
4   2   1   100.000000
5   2   4   152.442111

How to incorporate the group by to the code above?

CodePudding user response:

Lets define a custom transformation function for scaling the values:

def scale(s, scaler):
    s = s.to_numpy()[:, None]
    return scaler.fit_transform(s).squeeze()

df['x'] = df.groupby('a')['b'].transform(scale, scaler)

Alternatively, lets do a little bit of math to calculate min-max transformation from scratch and then apply feature range upscaling:

g = df.groupby('a')['b']
min_, max_ = g.transform('min'), g.transform('max')

df['x'] = (df['b'] - min_) / (max_ - min_) * (200 - 100)   100

Result

   a  b      x
0  1  2  100.0
1  1  8  200.0
2  1  5  150.0
3  2  6  200.0
4  2  1  100.0
5  2  5  180.0
  • Related