I am dealing with a dataset in R divided into train and test. I preproces the data centering and dividing by the standard deviation and so, I want to store the mean and sd values of the training set to scale the test set using the same values. However, the precision obtained if I use the scale
function is much better than when I use the colmeans
and apply(x, 2, sd)
functions.
set.seed(5)
a = matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale = scale(a) # scale using the scale function
a_scale_custom = (a - colMeans(a)) / apply(a, 2, sd) # Using custom function
Now If I compare the mean of both matrices:
colMeans(a_scale)
[1] -9.270260e-17 -1.492891e-16 1.331857e-16
colMeans(a_scale_custom)
[1] 0.007461065 -0.004395052 -0.003046839
The matrix obtained using scale
has a column mean of value 0, while the matrix obtained substracting the mean using colMeans
has error in the order of 10^-2
. The same happens when comparing the standard deviations.
Is there any way I can obtain a better precision when scaling the data without using the scale
function?
CodePudding user response:
The custom function has a bug in the matrix layout. You need to transpose the matrix before subtracting the vector with t()
, then transpose it back. Try the following:
set.seed(5)
a <- matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale <- scale(a) # scale using the scale function
a_scale_custom <- t((t(a) - colMeans(a)) / apply(a, 2, sd))
colMeans(a_scale)
colMeans(a_scale_custom)
see also: How to divide each row of a matrix by elements of a vector in R