What does the summary function do to the output of regsubsets?-CodePudding

Let me preface this by saying that I do think this question is a coding question, not a statistics question. It would almost surely be closed over at Stats.SE.

The leaps package in R has a useful function for model selection called regsubsets which, for any given size of a model, finds the variables that produce the minimum residual sum of squares. Now I am reading the book Linear Models with R, 2nd Ed., by Julian Faraway. On pages 154-5, he has an example of using the AIC for model selection. The complete code to reproduce the example runs like this:

data(state)
statedata = data.frame(state.x77, row.names=state.abb)
require(leaps)
b = regsubsets(Life.Exp~.,data=statedata)
rs = summary(b)
rs$which
AIC = 50*log(rs$rss/50)   (2:8)*2
plot(AIC ~ I(1:7), ylab="AIC", xlab="Number of Predictors")

The rs$which command produces the output of the regsubsets function and allows you to select the model once you've plotted the AIC and found the number of parameters that minimizes the AIC. But here's the problem: while the typed-up example works fine, I'm having trouble with the wrong number of elements in the array when I try to use this code and adapt it to other data. For example:

require(faraway)
data(odor, package='faraway')
b=regsubsets(odor~temp gas pack 
              I(temp^2) I(gas^2) I(pack^2) 
              I(temp*gas) I(temp*pack) I(gas*pack),data=odor)
rs=summary(b)
rs$which
AIC=50*log(rs$rss/50)   (2:10)*2

produces a warning message:

Warning message:
In 50 * log(rs$rss/50)   (2:10) * 2 :
  longer object length is not a multiple of shorter object length

Sure enough, length(rs$rss)=8, but length(2:10)=9. Now what I need to do is model selection, which means I really ought to have an RSS value for each model size. But if I choose b$rss in the AIC formula, it doesn't work with the original example!

So here's my question: what is summary() doing to the output of the regsubsets() function? The number of RSS values is not only not the same, but the values themselves are not the same.

CodePudding user response：

regsubsets has a nvmax parameter to control the "maximum size of subsets to examine". By default this is 8. If you increase it to 9 or higher, your code works.

Please note though, that the 50 in your AIC formula is the sample size (i.e. 50 states in statedata). So for your second example, this should be nrow(odor), so 15.

CodePudding user response：

Ok, so you know the help page for regsubsets says

regsubsets returns an object of class "regsubsets" containing no user-serviceable parts. It is designed to be processed by summary.regsubsets.

You're about to find out why.

The code in regsubsets calls Alan Miller's Fortran 77 code for subset selection. That is, I didn't write it and it's in Fortran 77. I do understand the algorithm. In 1996 when I wrote leaps (and again in 2017 when I made a significant modification) I spent enough time reading the code to understand what the variables were doing, but regsubsets mostly followed the structure of the Fortran driver program that came with the code.

The rss field of the regsubsets object has that name because it stores a variable called RSS in the Fortran code. This variable is not the residual sum of squares of the best model. RSS is computed in the setup phase, before any subset selection is done, by the subroute SSLEAPS, which is commented 'Calculates partial residual sums of squares from an orthogonal reduction from AS75.1.' That is, RSS describes the RSS of the models with no selection fitted from left to right in the design matrix: the model with just the leftmost variable, then the leftmost two variables, and so on. There's no reason anyone would need to know this if they're not planning to read the Fortran so it's not documented.

The code in summary.regsubsets extracts the residual sum of squares in the output from the $ress component of the object, which comes from the RESS variable in the Fortran code. This is an array whose [i,j] element is the residual sum of squares of the j-th best model of size i.

All the model criteria are computed from $ress in the same loop of summary.regsubsets, which can be edited down to this:

    for (i in ll$first:min(ll$last, ll$nvmax)) {
        for (j in 1:nshow) {
            vr <- ll$ress[i, j]/ll$nullrss
            rssvec <- c(rssvec, ll$ress[i, j])
            rsqvec <- c(rsqvec, 1 - vr)
            adjr2vec <- c(adjr2vec, 1 - vr * n1/(n1   ll$intercept - 
                i))
            cpvec <- c(cpvec, ll$ress[i, j]/sigma2 - (n1   ll$intercept - 
                2 * i))
            bicvec <- c(bicvec, (n1   ll$intercept) * log(vr)   
                i * log(n1   ll$intercept))
        }
    }

cpvec gives you the same information as AIC, but if you want AIC it would be straightforward to do the same loop and compute it.