Why do I get different values of R-squared for these two models, which should be equivalent (in the second model, the intercept term is replaced by a level of z)? Is this a bug or am I missing something?
set.seed(42)
N=100
# intercepts
iA = 3
iB = 3.5
# slopes
sA = 1.5
sB = 0.5
# xs
xA = runif(0,1, n=N)
xB = runif(0,1, n=N)
# ys
yA = sA*xA iA rnorm(n=N)/10
yB = sB*xB iB rnorm(n=N)/10
data = data.frame(x=c(xA, xB), y=c(yA, yB), z=c(rep("A", times=N), rep("B", times=N)))
lm1 = lm(data=data, formula = y ~ x z)
lm2 = lm(data=data, formula = y ~ x z -1)
coef(lm1)
coef(lm2)
summary(lm1)$r.squared
summary(lm2)$r.squared
Output:
> coef(lm1)
(Intercept) x zB
3.23590275 1.03353472 -0.01435266
> coef(lm2)
x zA zB
1.033535 3.235903 3.221550
>
> summary(lm1)$r.squared
[1] 0.7552991
> summary(lm2)$r.squared
[1] 0.9979477
CodePudding user response:
For models with an intercept summary.lm
calculates an R^2 based on comparing the model to the intercept only model. For a model without an intercept that does not make sense so it compares it to the zero model. Of course in your example the intercept is actually a linear combination of columns in the model matrix, i.e. all(model.matrix(lm2) %*% c(0, 1, 1) == 1)
is TRUE so it would be possible to write software to check for whether the intercept is a submodel of the full model but as it is it only looks to see if the model formula specifies an intercept or not.
In terms of calculations for models with an intercept summary.lm
uses the equivalent of
1 - sum(resid(lm1)^2) / sum((data$y - mean(data$y))^2)
## [1] 0.7552991
1 - sum(resid(lm2)^2) / sum((data$y - mean(data$y))^2)
## [1] 0.7552991
but for models without an intercept summary.lm
drops the mean term
1 - sum(resid(lm2)^2) / sum(data$y^2)
## [1] 0.9979477
You can compare these to
summary(lm1)$r.squared
## [1] 0.7552991
summary(lm2)$r.squared
## [1] 0.9979477
See ?summary.lm
where this is mentioned.
CodePudding user response:
From help("summary.lm")
(emphasis added):
R², the ‘fraction of variance explained by the model’,
R^2 = 1 - \frac{\sum_i{R_i^2}}{\sum_i(y_i- y^*)^2},
where y^* is the mean of y_i if there is an intercept and zero otherwise.
If you remove the intercept, R² is defined differently (which is sensible from the perspective of a statistician).