Python matrix column cyclic assignment-CodePudding

I have a matrix $X=(X_{ij})\in R^{n\times d}$ and a vector $y=(y_i)\in R^{n}$ and I want to centralize $y$ and centralize and normalize $X$ in columns , i.e., $\sum_{i=1}^{n} y_i=0$, $\sum_{i=1}^{n} X_{ij}=0$ 和 $\sum_{i=1}^{n} X_{ij}^2=1$, $j=1,2,\ldots,d$.

I have implemented it in R successfully, with a toy example as follows (PS: My toy example is very simple that can be manually computed and checked! )

### wirte the function
normalize = function(X,y){
n = dim(X)[1]; d = dim(X)[2] 
D = matrix(0,d,1)
y = y - mean(y) #center y
for (k in 1:d){
    Xk = X[,k]
    Xk = Xk - mean(Xk) #column-center X
    ck = 1/norm(Xk,"2")
    X[,k] = Xk * ck #column-normalization X
    D[k,1] = ck
}   
out = list(X=X,D=D,y=y)
return(out)
}

### test the function
X = matrix(1:4,2,byrow=T)
y = matrix(1:2) 
out = normalize(X,y)
( ny = out$y ) #show centered y
mean(ny) #check y center
( nX = out$X ) #show centered and normalized X (each column)
apply(nX,2,mean) #check X column-center
apply(nX,2,function(x)norm(x,"2")) #check X column-normalization

Then I want to reproduce the R procedure in Python as strictly as possible and try it as

### wirte the function
import numpy as np
def normalize(X,y):
  n,d = X.shape 
  D = np.zeros((d,1))
  y = y-np.mean(y) #center y
  for k in range(d): 
    Xk = X[:,k]
    Xk = Xk - np.mean(Xk) #column-center X
    ck = 1/np.linalg.norm(Xk)
    X[:,k] = Xk * ck #column-normalization X
    D[k,:] = ck
  
  return X,D,y
 
### test the function
X = np.arange(1,5,1).reshape(2,2)
y = np.array([1,2]).reshape(-1,1)
nX,D,ny = normalize(X,y)
ny #show centered y
np.mean(ny) #check y center
nX #show centered and normalized X (each column)
np.mean(nX,axis=0) #check X column-center
np.linalg.norm(nX, axis=0) #check X column-normalization

The results of ny and D in Python matches that in R. However, the nX output in Python gives obviously the wrong result, which shows the zero matrix! Did I missing something?

CodePudding user response：

This happens because when you are creating the X matrix with np.arange, numpy infers the dtype of the matrix to be int. Hence, when you are setting the values with 1/sqrt(2) they get truncated to 0.

Declare X as follows:

X = np.arange(1, 5, 1, dtype=np.float64).reshape(2, 2)

Resulting X:

array([[-0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]])

As a side note, you can create a function that returns the same result and does not use a loop by using vectorised numpy functions:

def normalize(X, y):
    Xk = X - np.mean(X, axis=0)
    D = 1/np.linalg.norm(Xk, axis=0)
    return Xk * D, D.reshape(-1, 1), y - np.mean(y)

Anothet difference is in side-effects. This function does not modify the original X matrix, while yours does.