Home > front end >  Should we apply normalization on whole data set or only X
Should we apply normalization on whole data set or only X

Time:05-23

I am doing a project based on Machine learning (Python) and trying all models on my data. Really confused in

For Classification and For Regression

  1. If I have to apply normalization, Z Score or Standard deviation on whole data set and then set the values of Features(X) and output(y).
    def normalize(df):
        from sklearn.preprocessing import MaxAbsScaler
        scaler = MaxAbsScaler()
        scaler.fit(df)
        scaled = scaler.transform(df)
        scaled_df = pd.DataFrame(scaled, columns=df.columns)
        return scaled_df
    
data=normalize(data)
X=data.drop['col']
y=data['col']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  1. Or only have to apply on features(X)
X=data.drop['col']
y=data['col']

def normalize(df):
    from sklearn.preprocessing import MaxAbsScaler
    scaler = MaxAbsScaler()
    scaler.fit(df)
    scaled = scaler.transform(df)
    scaled_df = pd.DataFrame(scaled, columns=df.columns)
    return scaled_df

X=normalize(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

CodePudding user response:

TLDR; do normalization on input data, but don't do it on output.

Logically, the normalization is both algorithm dependent and also feature based.

Some algorithms do not require any normalization (like decision trees).

Applying normalization on the dataset: You should perform normalization per feature but on all examples existing in the whole dataset if you have more than one feature in your dataset.

For example, let's say you have two features of X and Y. feature X is always a decimal in the range [0,10]. On the other hand, you have Y in the range [100K,1M]. If you do normalization once for X and Y and once for X and Y combined, you would see how the values of feature X become insignificant.

For Output (labels):

Generally, there is no need to normalize output or labels for any regression or classification tasks. But, make sure to do normalization on training data during training time and inference time.

if the task is the classification, the common approach is just encoding the class numbers (if you have classes dog and cat. you assign 0 to one and 1 to the other)

  • Related