I have a matrix $X=(X_{ij})\in R^{n\times d}$ and a vector $y=(y_i)\in R^{n}$ and I want to centralize $y$ and centralize and normalize $X$ in columns , i.e., $\sum_{i=1}^{n} y_i=0$, $\sum_{i=1}^{n} X_{ij}=0$ 和 $\sum_{i=1}^{n} X_{ij}^2=1$, $j=1,2,\ldots,d$.
I have implemented it in R successfully, with a toy example as follows (PS: My toy example is very simple that can be manually computed and checked! )
### wirte the function
normalize = function(X,y){
n = dim(X)[1]; d = dim(X)[2]
D = matrix(0,d,1)
y = y - mean(y) #center y
for (k in 1:d){
Xk = X[,k]
Xk = Xk - mean(Xk) #column-center X
ck = 1/norm(Xk,"2")
X[,k] = Xk * ck #column-normalization X
D[k,1] = ck
}
out = list(X=X,D=D,y=y)
return(out)
}
### test the function
X = matrix(1:4,2,byrow=T)
y = matrix(1:2)
out = normalize(X,y)
( ny = out$y ) #show centered y
mean(ny) #check y center
( nX = out$X ) #show centered and normalized X (each column)
apply(nX,2,mean) #check X column-center
apply(nX,2,function(x)norm(x,"2")) #check X column-normalization
Then I want to reproduce the R procedure in Python as strictly as possible and try it as
### wirte the function
import numpy as np
def normalize(X,y):
n,d = X.shape
D = np.zeros((d,1))
y = y-np.mean(y) #center y
for k in range(d):
Xk = X[:,k]
Xk = Xk - np.mean(Xk) #column-center X
ck = 1/np.linalg.norm(Xk)
X[:,k] = Xk * ck #column-normalization X
D[k,:] = ck
return X,D,y
### test the function
X = np.arange(1,5,1).reshape(2,2)
y = np.array([1,2]).reshape(-1,1)
nX,D,ny = normalize(X,y)
ny #show centered y
np.mean(ny) #check y center
nX #show centered and normalized X (each column)
np.mean(nX,axis=0) #check X column-center
np.linalg.norm(nX, axis=0) #check X column-normalization
The results of ny
and D
in Python matches that in R. However, the nX
output in Python gives obviously the wrong result, which shows the zero matrix! Did I missing something?
CodePudding user response:
This happens because when you are creating the X
matrix with np.arange
, numpy infers the dtype of the matrix to be int
. Hence, when you are setting the values with 1/sqrt(2)
they get truncated to 0
.
Declare X
as follows:
X = np.arange(1, 5, 1, dtype=np.float64).reshape(2, 2)
Resulting X
:
array([[-0.70710678, -0.70710678],
[ 0.70710678, 0.70710678]])
As a side note, you can create a function that returns the same result and does not use a loop by using vectorised numpy functions:
def normalize(X, y):
Xk = X - np.mean(X, axis=0)
D = 1/np.linalg.norm(Xk, axis=0)
return Xk * D, D.reshape(-1, 1), y - np.mean(y)
Anothet difference is in side-effects. This function does not modify the original X
matrix, while yours does.