I am fitting the same linear model in two different ways, resulting in the same parameter estimates but differing R-squared values. Where does the difference come from? Is this a bug in R? Here is my code:
m1 <- lm(stack.loss ~ ., data = stackloss)
summary(m1)
X <- model.matrix(m1)
y <- stackloss$stack.loss
m2 <- lm(y ~ 0 X)
summary(m2)
The output for m1
is the following (slightly shortened):
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.9197 11.8960 -3.356 0.00375 **
Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
Water.Temp 1.2953 0.3680 3.520 0.00263 **
Acid.Conc. -0.1521 0.1563 -0.973 0.34405
Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
The output for m2
is has the same estimates for coefficients and residual standard error, but different R-squared values and different F-statistic:
Estimate Std. Error t value Pr(>|t|)
X(Intercept) -39.9197 11.8960 -3.356 0.00375 **
XAir.Flow 0.7156 0.1349 5.307 5.8e-05 ***
XWater.Temp 1.2953 0.3680 3.520 0.00263 **
XAcid.Conc. -0.1521 0.1563 -0.973 0.34405
Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared: 0.979, Adjusted R-squared: 0.9741
F-statistic: 198.2 on 4 and 17 DF, p-value: 5.098e-14
Why are the R-squared values different?
CodePudding user response:
This is discussed in this post and also this. Here's a break down of what is happening in the lm() source code. The relevant part:
r <- z$residuals
f <- z$fitted.values
w <- z$weights
if (is.null(w)) {
mss <- if (attr(z$terms, "intercept"))
sum((f - mean(f))^2) else sum(f^2)
rss <- sum(r^2)
}
Although you included an intercept, the attributes of the terms are not set to include an intercept, compare:
attr(m1$terms,"intercept")
[1] 1
attr(m2$terms,"intercept")
[1] 0
I do not advise doing this, because you can easily use the formula interface to fit the model, without providing the model matrix yourself. But you can see by changing the attribute, you can get summary.lm
to use the correct rss and get the correct r-squared:
attr(m2$terms,"intercept") = 1
Call:
lm(formula = y ~ 0 X)
Residuals:
Min 1Q Median 3Q Max
-7.2377 -1.7117 -0.4551 2.3614 5.6978
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X(Intercept) -39.9197 11.8960 -3.356 0.00375 **
XAir.Flow 0.7156 0.1349 5.307 5.8e-05 ***
XWater.Temp 1.2953 0.3680 3.520 0.00263 **
XAcid.Conc. -0.1521 0.1563 -0.973 0.34405
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
CodePudding user response:
StupidWolf gave you the answer. You are estimating two different regression models.
Because your second model specification m2 <- lm(y ~ 0 X). You are not estimating an intercept and you have a extra variable variable X(intercept).
To get the same R^2 just correct the model
m1 <- lm(stack.loss ~ ., data = stackloss)
summary(m1)
X <- model.matrix(m1)
y <- stackloss$stack.loss
m2 <- lm(y ~ X)
summary(m2)
Gives you the same R^2 since you regress the same model.