Let me preface this by saying that I do think this question is a coding question, not a statistics question. It would almost surely be closed over at Stats.SE.
The leaps
package in R has a useful function for model selection called regsubsets
which, for any given size of a model, finds the variables that produce the minimum residual sum of squares. Now I am reading the book Linear Models with R, 2nd Ed., by Julian Faraway. On pages 154-5, he has an example of using the AIC for model selection. The complete code to reproduce the example runs like this:
data(state)
statedata = data.frame(state.x77, row.names=state.abb)
require(leaps)
b = regsubsets(Life.Exp~.,data=statedata)
rs = summary(b)
rs$which
AIC = 50*log(rs$rss/50) (2:8)*2
plot(AIC ~ I(1:7), ylab="AIC", xlab="Number of Predictors")
The rs$which
command produces the output of the regsubsets
function and allows you to select the model once you've plotted the AIC and found the number of parameters that minimizes the AIC. But here's the problem: while the typed-up example works fine, I'm having trouble with the wrong number of elements in the array when I try to use this code and adapt it to other data. For example:
require(faraway)
data(odor, package='faraway')
b=regsubsets(odor~temp gas pack
I(temp^2) I(gas^2) I(pack^2)
I(temp*gas) I(temp*pack) I(gas*pack),data=odor)
rs=summary(b)
rs$which
AIC=50*log(rs$rss/50) (2:10)*2
produces a warning message:
Warning message:
In 50 * log(rs$rss/50) (2:10) * 2 :
longer object length is not a multiple of shorter object length
Sure enough, length(rs$rss)=8
, but length(2:10)=9
. Now what I need to do is model selection, which means I really ought to have an RSS value for each model size. But if I choose b$rss
in the AIC formula, it doesn't work with the original example!
So here's my question: what is summary()
doing to the output of the regsubsets()
function? The number of RSS values is not only not the same, but the values themselves are not the same.
CodePudding user response:
regsubsets
has a nvmax
parameter to control the "maximum size of subsets to examine". By default this is 8. If you increase it to 9 or higher, your code works.
Please note though, that the 50
in your AIC formula is the sample size (i.e. 50 states in statedata
). So for your second example, this should be nrow(odor)
, so 15
.
CodePudding user response:
Ok, so you know the help page for regsubsets
says
regsubsets
returns an object of class"regsubsets"
containing no user-serviceable parts. It is designed to be processed bysummary.regsubsets
.
You're about to find out why.
The code in regsubsets
calls Alan Miller's Fortran 77 code for subset selection. That is, I didn't write it and it's in Fortran 77. I do understand the algorithm. In 1996 when I wrote leaps
(and again in 2017 when I made a significant modification) I spent enough time reading the code to understand what the variables were doing, but regsubsets
mostly followed the structure of the Fortran driver program that came with the code.
The rss
field of the regsubsets
object has that name because it stores a variable called RSS
in the Fortran code. This variable is not the residual sum of squares of the best model. RSS
is computed in the setup phase, before any subset selection is done, by the subroute SSLEAPS
, which is commented 'Calculates partial residual sums of squares from an orthogonal reduction from AS75.1.' That is, RSS
describes the RSS of the models with no selection fitted from left to right in the design matrix: the model with just the leftmost variable, then the leftmost two variables, and so on. There's no reason anyone would need to know this if they're not planning to read the Fortran so it's not documented.
The code in summary.regsubsets
extracts the residual sum of squares in the output from the $ress
component of the object, which comes from the RESS
variable in the Fortran code. This is an array whose [i,j]
element is the residual sum of squares of the j
-th best model of size i
.
All the model criteria are computed from $ress
in the same loop of summary.regsubsets
, which can be edited down to this:
for (i in ll$first:min(ll$last, ll$nvmax)) {
for (j in 1:nshow) {
vr <- ll$ress[i, j]/ll$nullrss
rssvec <- c(rssvec, ll$ress[i, j])
rsqvec <- c(rsqvec, 1 - vr)
adjr2vec <- c(adjr2vec, 1 - vr * n1/(n1 ll$intercept -
i))
cpvec <- c(cpvec, ll$ress[i, j]/sigma2 - (n1 ll$intercept -
2 * i))
bicvec <- c(bicvec, (n1 ll$intercept) * log(vr)
i * log(n1 ll$intercept))
}
}
cpvec
gives you the same information as AIC, but if you want AIC it would be straightforward to do the same loop and compute it.