Home > Enterprise >  How do I use the means of the columns of a matrix as prediction values in a linear regression in R?
How do I use the means of the columns of a matrix as prediction values in a linear regression in R?

Time:05-24

Problem Statement: Some near infrared spectra on 60 samples of gasoline and corresponding octane numbers can be found by data(gasoline, package="pls"). Compute the mean value for each frequency and predict the response for the best model using the five different methods from Question 4.

Note: This is Exercise 11.5 in Linear Models with R, 2nd Ed., by Julian Faraway. Also, the "five different methods from Question 4" are: linear regression with all predictors, linear regression with variables selected using AIC, principal component regression, partial least squares, and ridge regression.

My Work So Far: We do

require(pls)
data(gasoline, package="pls")
test_index = seq(1,nrow(gasoline),10)
train_index = 1:nrow(gasoline)
train_index = train_index[!train_index %in% test_index]
train_gas = gasoline[train_index,]
test_gas = gasoline[test_index,]
lmod = lm(octane~NIR,train_gas)

So far, so good. However, if I look at a summary of the model, I find that 348 coefficients are not defined because of singularities. (Why?) Moreover, massaging the mean values of the columns of the NIR matrix (the predictors) into an acceptable data frame is proving difficult.

My Question: How can I get to the point where the highly-fussy predict function will let me do something like this:

new_data = apply(train_gas$NIR, 2, mean)
*some code here*
predict(lmod, new_data)

?

Incidentally, as I have done a significant amount of moderating on Stats.SE, I can assert positively that this question would be closed on Stats.SE as being off-topic. It's a "programming or data request", and hence unwelcome on Stats.SE.

I have also looked up a few related questions on SO, but nothing seems to fit exactly.

CodePudding user response:

This does seem pretty CrossValidated-ish to me ... gasoline is a rather odd object, containing a 'column' (element) that is a 401-column matrix:

data.frame':    60 obs. of  2 variables:
 $ octane: num  85.3 85.2 88.5 83.4 87.9 ...
 $ NIR   : 'AsIs' num [1:60, 1:401] -0.0502 -0.0442 -0.0469 -0.0467 -0.0509 ...

However, the fundamental problem is that this is a p>>n problem; there are 60 observations and 401 predictors. Thus, a standard linear regression probably just won't make sense - you probably want to use a penalized approach like LASSO/ridge (i.e., glmnet). This is why you get the undefined coefficients (without some kind of penalization, you can't estimate 402 coefficients (ncols 1 for the intercept) from 60 observations ...)

However, if we do want to hack this into a shape where we can do the linear model and prediction (however ill-advised):

NIR <- gasoline$NIR
class(NIR) <- "matrix" ## override "AsIs" class
g2 <- data.frame(octane = gasoline$octane, NIR)
dim(g2) ## 60 402 - now this is a 'regular' data frame
## using train_index from above
train_gas <- g2[train_index,]
lmod = lm(octane~., train_gas)
## drop first column (response); use `lapply()` to maintain list structure
new_data <- as.data.frame(lapply(train_gas[-1], mean))
predict(lmod, new_data)
##        1 
## 87.16019 
## Warning message:
## In predict.lm(lmod, new_data) :
##   prediction from a rank-deficient fit may be misleading

A slightly more direct approach (but still ugly) is to fit the model to the original weird structure and construct a prediction frame that matches that weird structure, i.e.

pp <- data.frame(NIR=I(matrix(colMeans(train_gas$NIR), nrow = 1)))

If you were willing to forgo predict() you could do it like this:

sum(na.omit(coef(lmod) * c(1, colMeans(train_gas$NIR))))
  • Related