Normalization Formula - Using Whole/Train Dataset To Calculate Mean and STD Within Formula?-CodePudding

The formula to Standadize dataset is as below:

Standardzied_X = (X - Mean)/(Standard Deviation)

I was wondering if I am supposed to find mean and std on the whole dataset (concatenation of train and test) or only on train dataset. I read somewhere mean and STD of train dataset should be used in normalization formula for both train and test dataset, but it doesnt make sense to me.

Assuming data is in variables Xtr, Ytr, Xte, Yte with shapes [n,d], [n,1], [m,d], [m,1] respectively.

My answer is as below :

X_mean = numpy.mean(numpy.concatenate(X_tr , X_te))

X_std = numpy.std(numpy.concatenate(X_tr , X_te))

X_tr_norm = (X_tr - X_mean)/X_std

X_te_norm = (X_te - X_mean)/X_std

CodePudding user response：

mean and STD of train dataset should be used in normalization formula for both train and test dataset, but it doesnt make sense to me

It makes sense to me. Firstly, train dataset should be big enough to provide good estimation of mean and std. And secondly, in such a way you do not "cheat" by using test data during training.

If you really want to calculate mean of both train and test data you do not have to concatenate them. You can calculate weighted mean and std yourself.

combined_mean = (len(train_data) * train_data.mean()   len(test_data) * test_data.mean()) \
                / (len(train_data)   len(test_data))
# std is more tricky
train_second_moment = np.mean((train_data - combined_mean) ** 2)
test_second_moment = np.mean((test_data - combined_mean) ** 2)
combined_std = np.sqrt((len(train_data) * train_second_moment   len(test_data) * test_second_moment) \
               / (len(train_data)   len(test_data)))