I have a scikit-learn MinMaxScaler feature engineering transformation, it works fine, but I need to combine it with group by, that is the min and the max needs to be taken by a specific column.
This is the code that works without group by:
from sklearn.preprocessing import MinMaxScaler
mydict = [{'a': 1, 'b': 2 },
{'a': 1, 'b': 8 },
{'a': 1, 'b': 5 },
{'a': 2, 'b': 6 },
{'a': 2, 'b': 1 },
{'a': 2, 'b': 5 }]
df = pd.DataFrame(mydict)
scaler = MinMaxScaler(feature_range=(100,200))
df['x'] = scaler.fit_transform(df[['b']])
print(df)
a b x
0 1 2 114.285714
1 1 8 200.000000
2 1 5 157.142857
3 2 6 171.428571
4 2 1 100.000000
5 2 5 157.142857
Note that the min happens when a = 2 and the max when a = 1. I need to group by a
therefore the result should look something like this:
a b x
0 1 2 100.000000
1 1 8 200.000000
2 1 5 157.142857
3 2 6 200.000000
4 2 1 100.000000
5 2 4 152.442111
How to incorporate the group by to the code above?
CodePudding user response:
Lets define a custom transformation function for scaling the values:
def scale(s, scaler):
s = s.to_numpy()[:, None]
return scaler.fit_transform(s).squeeze()
df['x'] = df.groupby('a')['b'].transform(scale, scaler)
Alternatively, lets do a little bit of math to calculate min-max transformation from scratch and then apply feature range upscaling:
g = df.groupby('a')['b']
min_, max_ = g.transform('min'), g.transform('max')
df['x'] = (df['b'] - min_) / (max_ - min_) * (200 - 100) 100
Result
a b x
0 1 2 100.0
1 1 8 200.0
2 1 5 150.0
3 2 6 200.0
4 2 1 100.0
5 2 5 180.0