Home > Mobile >  Find OLS solution using optim
Find OLS solution using optim

Time:06-18

I am trying to find the solution of a linear model using optim. I could use lm or even use matrix multiplication to find explicit solution. However, the end goal is to use different objective functions. Since I can't replicate the lm solution, there is something wrong with my code and I cannot figure it out. My linear model does not have an intercept term.

Here is an example

RSS_f <- function(x){
  f_hat=as.matrix(dat[,2:ncol(dat)])%*%(as.matrix(x))
  error=dat[,1] - f_hat
  return(sum(error^2))
}

x0=rep(0,6)

r<- optim(x0,RSS_f)

Here is the sample data I have been using

dat=structure(list(y = c(-0.015592301115793, -0.257703905025368, 
0.132944891819098, 0.144585602249198, -0.0834638586276937, -0.132506252754974, 
-0.0671452738850035, 0.017093324992248), x1 = c(0.0762815118005244, 
0.0684222303897641, 0.0126545677701762, -0.0283951453807663, 
0.0682085328963522, 0.030322515917991, -0.0288021662870674, 0.0213481986485853
), x2 = c(0.121096462766523, 0.173507656658378, 0.0830579904733009, 
-0.0131531465094037, 0.14514027572694, -0.00235172620017821, 
-0.116782685211159, 0.128253294500769), x3 = c(0.00271738492481077, 
0.199849189671841, -0.149623094678353, -0.0375499460631867, 0.035188928292329, 
0.0421848373709954, -0.212940391685557, -0.280765720194465), 
    x4 = c(-0.661111149054194, -1.62711136938006, -0.16384009438204, 
    1.24217835189415, -1.01181978837279, -0.968527314213574, 
    -0.271489340669151, -0.602168687364268), x5 = c(1.6840335520254, 
    0.0085051611667053, 0.803743907332177, -1.86248825400901, 
    1.67581124074128, -0.0919360291035565, -0.210122962144954, 
    1.96152949931911), x6 = c(0.251747455313378, -0.353511278663921, 
    -0.155085549001921, -0.376159415130184, -0.0614334625077317, 
    0.759475827597367, 0.10074325959879, 0.512321974431873)), row.names = 63:56, class = "data.frame")

CodePudding user response:

You forget the intercept. lm fits a model

y = a b1 * x1 b2 * x2 b3 * x3 b4 * x4 b5 * x5 b6 * x6 error

while you are trying

y = b1 * x1 b2 * x2 b3 * x3 b4 * x4 b5 * x5 b6 * x6 error

Redefine your function as:

RSS_f <- function(x, dat) {
  f_hat <- x[1]   as.matrix(dat[, -1]) %*% x[-1]
  error <- dat[, 1] - f_hat
  sum(error^2)
}

You also need to choose initial value for optim wisely.

b <- unname(lm(y ~ ., data = dat)$coefficients)
#[1]  0.07462097 -2.48751498  0.37746968  0.40302696  0.18224149  0.09565363
#[7]  0.02259334

## in a neighborhood of true values
init <- rnorm(length(b), mean = b, sd = 0.01)

r <- optim(init, RSS_f, dat = dat)
r$par
#[1]  0.07248056 -2.39952127  0.37383708  0.38479202  0.18192510  0.09347675
#[7]  0.02204713

I was slow in finishing my answer...

As Prof. Bolker mentions, optim struggles to converge so we see discrepancy in the computed results (r$converge is 1, not 0), even if init is close to b enough. I was about to say that you have too few data: 8 data vs 7 parameters or 6 parameters (no intercept) is just hard for approximation-based numerical optimization.

While Bolker suggests that you can tune optim by increasing maxit, the default simplex method (i.e., Nelder-Mead) is just too slow. What I often suggest, is to use method = "BFGS" and provide exact gradient via argument gr whenever we can. This will dramatically increase accuracy, convergence rate and computation speed.

RSS_g <- function (x, dat) {
  y <- dat[, 1]
  X <- cbind(1, as.matrix(dat[, -1]))
  y_hat <- X %*% x
  error <- y - y_hat
  -2 * crossprod(X, error)
}

optim(init, fn = RSS_f, gr = RSS_g, method = "BFGS", dat = dat)
#$par
#[1]  0.07469957 -2.49096695  0.37763926  0.40371461  0.18224659  0.09573302
#[7]  0.02261385
#
#$value
#[1] 0.008676038
#
#$counts
#function gradient 
#      24       16   ## 16 steps till convergence
#
#$convergence
#[1] 0
#
#$message
#NULL

For the poor initial value numeric(7). It takes more steps to converge, but maxit = 400 is sufficient.

optim(numeric(7), fn = RSS_f, gr = RSS_g, method = "BFGS", dat = dat,
      control = list(maxit = 400))
#$par
#[1]  0.07113841 -2.33775916  0.36920607  0.37293500  0.18186110  0.09220104
#[7]  0.02139395
#
#$value
#[1] 0.008678216
#
#$counts
#function gradient   ## 349 steps till convergence
#     477      349 
#
#$convergence
#[1] 0
#
#$message
#NULL

Full Newton's method

Even better is to provide exact Hessian matrix and use nlm instead of optim.

RSS_h <- function (x, dat) {
  X <- cbind(1, as.matrix(dat[, -1]))
  2 * crossprod(X)
}

attr(RSS_f, "gradient") <- RSS_g
attr(RSS_f, "hessian") <- RSS_h

nlm(RSS_f, numeric(7), dat = dat)
$minimum
[1] 0.008676037

$estimate
[1]  0.07462042 -2.48751019  0.37747413  0.40302288  0.18224109  0.09565303
[7]  0.02259344

$gradient
[1] -8.815716e-08 -3.457165e-08  6.583843e-08 -8.195144e-08 -3.034804e-07
[6] -7.572794e-07 -1.521478e-07

$code  ## 1 or 2 means convergence
[1] 1

$iterations
[1] 44

Optimization is an art. There is a lot to consider. But generally speaking, the more useful information you provide, the easier for computers to do the dirty work.

CodePudding user response:

I see one obvious problem here, and encountered several non-obvious problems as I moved forward.

  1. you forgot to include an intercept in your model matrix (your edit makes clear that this is not the problem — it would have saved time to mention this in the first place!)
  2. when fitting this model with the default (Nelder-Mead) algorithm from these starting points, it takes more than the default number of iterations (500) to converge.
  3. your example set of data/predictors is close to multicollinear, which makes naive/"brute force" optimization approaches unstable

Slightly reorganized objective function (the only important difference is cbind(1, ...) to add an intercept column to the model matrix: point (1) above)

RSS_f <- function(beta, dat){
    ## inefficient, should compute these outside the loop, but ...
    X <- cbind(1, as.matrix(dat[,-1]))
    y <- dat[,1]
    f_ha <- X %*% beta
    return(sum((y - f_ha)^2))
}
x0 <- rep(0,7)  ## need one additional element in the vector for the intercept
r <- optim(x0, RSS_f, dat=dat, control = list(maxit = 1e5))

Points 2 and 3 are related to @ZheyuanLi's answer (i.e., picking a good starting point will mitigate them).

Point (2): if we run this optim() without the control(list(maxit = 1e5)), it fails to converge — but silently. You have to look at r$convergence, see that the value is 1 and not 0 (zero = "converged"), and then go look at ?optim to see that r$convergence == 1 means "maximum number of iterations observed".

Point (3): if you look at the coefficients, you'll see that they still don't quite match lm. This data set has 8 observations and 7 predictors, which makes the OLS fit pretty ambitious (although not impossible). I tried a few other options (e.g. method = "BFGS" in optim()), which didn't help much.

You can get an idea of the stability of the OLS problem by looking at the ratio of the largest to the smallest eigenvalue of crossprod(X) (i.e., t(X) %*% X); in this case it's a factor of about 1e5 (bad news).

X <- cbind(1, as.matrix(dat[,-1]))
eigen(crossprod(X))$values
[1] 2.034880e 01 6.082053e 00 2.730513e 00 9.748323e-01 8.148341e-02
[6] 1.358231e-02 9.186567e-05

Alternatively, compute the condition number: Matrix::condest(crossprod(X)) is 2.75e5 (again, a large value is bad).

For comparison, if I generate predictors randomly (m <- matrix(rnorm(48), nrow = 8)), add an intercept column, and compute the condition number of crossprod(m), the values are typically on the order of 500 (1000 random replicates ranged from about 20 to 30,000). So part of the problem is 'not much data for a model with this many predictors', but your example is even worse than a randomly sampled one.

R's built-in method for fitting linear models (QR with a custom pivoting strategy) works robustly even on such ill-posed problems.

I don't think removing the intercept will change the ill-conditioning problem very much, although I haven't checked.

  • Related