I want to apply MinmaxScaler on a number of pandas DataFrame 'together'. Meaning that I want the scaler to perform on all data in those columns, not separately on each column.
My DataFrame has 20 columns. I want to apply the scaler on 12 of the columns at the same time. I have already read this. But it does not solve my problem since it acts on each column separately.
CodePudding user response:
IIUC, you want the sklearn
scaler to fit and transform multiple columns with the same criteria (in this case min and max definitions). Here is one way you can do this -
- You can save the initial shape of the columns and then transform the numpy array of those columns into a 1D array from a 2D array.
- Next you can fit your scaler and transform this 1D array
- Finally you can use the old shape to reshape the array back into the n columns you need and save them
The advantage of this approach is that this works with any of the sklearn scalers you need to use, MinMaxScaler
, StandardScaler
etc.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']})
cols = ['A','B']
old_shape = dfTest[cols].shape #(5,2)
dfTest[cols] = scaler.fit_transform(dfTest[cols].to_numpy().reshape(-1,1)).reshape(old_shape)
print(dfTest)
A B C
0 0.000000 0.884188 big
1 0.756853 0.926301 small
2 0.764303 0.956992 big
3 0.817143 0.995530 small
4 0.766885 1.000000 small
CodePudding user response:
you can extract the "min" and "max" statistics from those columns and perform the scaling yourself:
# columns of interest
cols = [...]
# get the minimum and maximum values in that region
vals = df[cols].to_numpy()
min_val = vals.min()
max_val = vals.max()
# scale the region using them
df[cols] = df[cols].sub(min_val).div(max_val - min_val)
(sub
is method way of doing "-" and div
is for "/".)
Above, df
is your training dataframe; to scale the testing dataframe, you replace df
with that in the last line, e.g.,
test_df[cols] = test_df[cols].sub(min_val).div(max_val - min_val)
instead of extracting min/max of it separately which would leak information from the test set.