Home > OS >  min max normalization dataframe in pandas
min max normalization dataframe in pandas

Time:12-19

I have a dataframe df:

df = pd.DataFrame({'A': [1, 2, 5, 3], 'B': [10, 0, 3, 7], 'C': [100, 200, 50, 500]})
df
   A   B    C
0  1  10  100
1  2   0  200
2  5   3   50
3  3   7  500

Now I use the following command to normalize the columns of df:

df[['A', 'B', 'C']] = df[['A', 'B', 'C']].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
df
      A    B         C
0  0.00  1.0  0.111111
1  0.25  0.0  0.333333
2  1.00  0.3  0.000000
3  0.50  0.7  1.000000

Also, I get the min and max parameters using the following command:

min_params = dict(df[['A', 'B', 'C']].min())
max_params = dict(df[['A', 'B', 'C']].max())

I use df for training phase. For inference, consider new dataframe df_new like this:

df_new = pd.DataFrame({'A': [10, 15, 20], 'B': [18, 17, 15], 'C': [250, 300, 150]})
df_new
    A   B    C
0  10  18  250
1  15  17  300
2  20  15  150

Now, I want to normalize the df_new like the above procedure with the min_params and max_params. What is the best and efficient way to do it with pandas?

CodePudding user response:

Use MinMaxScaler.

df = pd.DataFrame({'A': [1, 2, 5, 3], 'B': [10, 0, 3, 7], 'C': [100, 200, 50, 500]})
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler = scaler.fit(df)
scaler.transform(df)

Results

array([[0.        , 1.        , 0.11111111],
       [0.25      , 0.        , 0.33333333],
       [1.        , 0.3       , 0.        ],
       [0.5       , 0.7       , 1.        ]])

Now using the same scaler on new data

df_new = pd.DataFrame({'A': [10, 15, 20], 'B': [18, 17, 15], 'C': [250, 300, 150]})
scaler.transform(df_new)

Results

array([[2.25      , 1.8       , 0.44444444],
       [3.5       , 1.7       , 0.55555556],
       [4.75      , 1.5       , 0.22222222]])

CodePudding user response:

You can also apply the min, max directly using the pd.Series (not a dict)

min_params = df[['A', 'B', 'C']].min()
max_params = df[['A', 'B', 'C']].max()

on df without the lambda function:

df[['A', 'B', 'C']] = (df[['A', 'B', 'C']] - min_params) / (max_params- min_params)

      A    B         C
0  0.00  1.0  0.111111
1  0.25  0.0  0.333333
2  1.00  0.3  0.000000
3  0.50  0.7  1.000000

and on df_new:

df_new[['A', 'B', 'C']] = (df_new[['A', 'B', 'C']] - min_params) / (max_params- min_params)

Output:

      A    B         C
0  2.25  1.8  0.444444
1  3.50  1.7  0.555556
2  4.75  1.5  0.222222

Of course this is the exact same job MinMaxScaler is doing.

  • Related