inconsistent R-squared values in lm()-CodePudding

I am fitting the same linear model in two different ways, resulting in the same parameter estimates but differing R-squared values. Where does the difference come from? Is this a bug in R? Here is my code:

m1 <- lm(stack.loss ~ ., data = stackloss)
summary(m1)

X <- model.matrix(m1)
y <- stackloss$stack.loss
m2 <- lm(y ~ 0   X)
summary(m2)

The output for m1 is the following (slightly shortened):

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

The output for m2 is has the same estimates for coefficients and residual standard error, but different R-squared values and different F-statistic:

             Estimate Std. Error t value Pr(>|t|)    
X(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
XAir.Flow      0.7156     0.1349   5.307  5.8e-05 ***
XWater.Temp    1.2953     0.3680   3.520  0.00263 ** 
XAcid.Conc.   -0.1521     0.1563  -0.973  0.34405    

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.979, Adjusted R-squared:  0.9741 
F-statistic: 198.2 on 4 and 17 DF,  p-value: 5.098e-14

Why are the R-squared values different?

CodePudding user response：

This is discussed in this post and also this. Here's a break down of what is happening in the lm() source code. The relevant part:

r <- z$residuals
f <- z$fitted.values
w <- z$weights
if (is.null(w)) {
        mss <- if (attr(z$terms, "intercept"))
            sum((f - mean(f))^2) else sum(f^2)
        rss <- sum(r^2)
}

Although you included an intercept, the attributes of the terms are not set to include an intercept, compare:

attr(m1$terms,"intercept")
[1] 1

attr(m2$terms,"intercept")
[1] 0

I do not advise doing this, because you can easily use the formula interface to fit the model, without providing the model matrix yourself. But you can see by changing the attribute, you can get summary.lm to use the correct rss and get the correct r-squared:

attr(m2$terms,"intercept") = 1
Call:
lm(formula = y ~ 0   X)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2377 -1.7117 -0.4551  2.3614  5.6978 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
X(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
XAir.Flow      0.7156     0.1349   5.307  5.8e-05 ***
XWater.Temp    1.2953     0.3680   3.520  0.00263 ** 
XAcid.Conc.   -0.1521     0.1563  -0.973  0.34405    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

CodePudding user response：

StupidWolf gave you the answer. You are estimating two different regression models.

Because your second model specification m2 <- lm(y ~ 0 X). You are not estimating an intercept and you have a extra variable variable X(intercept).

To get the same R^2 just correct the model

m1 <- lm(stack.loss ~ ., data = stackloss)
summary(m1)

X <- model.matrix(m1)
y <- stackloss$stack.loss
m2 <- lm(y ~ X)
summary(m2)

Gives you the same R^2 since you regress the same model.