Home > OS >  Increase precision when standardizing test dataset
Increase precision when standardizing test dataset

Time:10-24

I am dealing with a dataset in R divided into train and test. I preproces the data centering and dividing by the standard deviation and so, I want to store the mean and sd values of the training set to scale the test set using the same values. However, the precision obtained if I use the scale function is much better than when I use the colmeans and apply(x, 2, sd) functions.

set.seed(5)
a = matrix(rnorm(30000,  mean=10, sd=5), 10000, 3)  # Generate data

a_scale = scale(a) # scale using the scale function
a_scale_custom = (a - colMeans(a)) / apply(a, 2, sd) # Using custom function

Now If I compare the mean of both matrices:

colMeans(a_scale)
[1] -9.270260e-17 -1.492891e-16  1.331857e-16

colMeans(a_scale_custom)
[1]  0.007461065 -0.004395052 -0.003046839

The matrix obtained using scale has a column mean of value 0, while the matrix obtained substracting the mean using colMeans has error in the order of 10^-2. The same happens when comparing the standard deviations.

Is there any way I can obtain a better precision when scaling the data without using the scalefunction?

CodePudding user response:

The custom function has a bug in the matrix layout. You need to transpose the matrix before subtracting the vector with t(), then transpose it back. Try the following:

set.seed(5)
a <- matrix(rnorm(30000,  mean=10, sd=5), 10000, 3)  # Generate data

a_scale <- scale(a) # scale using the scale function
a_scale_custom <- t((t(a) - colMeans(a)) / apply(a, 2, sd))

colMeans(a_scale)
colMeans(a_scale_custom)

see also: How to divide each row of a matrix by elements of a vector in R

  • Related