I am working on panel regression and decided to switch to lm()
, because plm()
does not have a good predict()
function for test data (as well as linearmodels
in Python) and lme4
syntax is not intuitive for me as a newbie to econometrics.
I want to use lm
and predict on unseen data.
My lm()
equation looks like this with author as a fixed effect
fit <- lm(y ~ a b factor(author), data = train)
Obviously it prints me thousands of individual coefficients. How to construct a model in lm()
, evaluating for all authors without printing individually?
CodePudding user response:
It sounds like you want to test the "overall" effect of including author
in the model, rather than testing differences among all the different authors. You can get at this by running separate models with and without the author
term, and comparing them. Here's a demonstration, using manufacturer
from ggplot::mpg
as a stand-in for your author
variable.
data(mpg, package = "ggplot2")
# base model
m1 <- lm(cty ~ displ, data = mpg)
# model with multicategorical predictor
m2 <- lm(cty ~ displ manufacturer, data = mpg)
# model comparison - test of effect of including `manufacturer`
(m2_v_m1 <- anova(m1, m2))
#> Analysis of Variance Table
#>
#> Model 1: cty ~ displ
#> Model 2: cty ~ displ manufacturer
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 232 1529.3
#> 2 218 1121.1 14 408.15 5.6688 2.393e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2022-03-11 by the reprex package (v2.0.1)
CodePudding user response:
Let's assume some concrete data example, e.g.
a <- rnorm(100)
b <- runif(100)
train <- data.frame(a, b,
author = sample(LETTERS[1:10], 100, 1),
y = 3*a .5*b rnorm(100))
Now we do a fixed effect regression, I assume we do not want any Intercept so the command is
fit <- lm(y ~ a b author - 1, data = train)
The - 1
part in the formula leaves the Intercpet out, instead a fixed effect is computed for each author
. No base level left out.
Printing this model or it's summary is feasible with the 10 authors in the example but not with thousands in your work.
You can print the coefficents of only a
and b
like this
> fit$coefficients[c('a', 'b')]
a b
3.02022335 0.09789947
You can see the coefficients and their p-values via the anova
command
> anova(fit)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
a 1 1139.90 1139.90 1034.2307 < 2.2e-16 ***
b 1 7.73 7.73 7.0127 0.009591 **
author 10 10.75 1.07 0.9751 0.470812
Residuals 88 96.99 1.10
And you can even deconstruct summary(fit)
for the coefficents or to display the adjusted R²:
> summary(fit)$coefficients[c("a", "b"),]
Estimate Std. Error t value Pr(>|t|)
a 3.02022335 0.09697699 31.1437123 2.679195e-49
b 0.09789947 0.35161039 0.2784317 7.813342e-01
>
> summary(fit)$adj.r.squared
[1] 0.9122033
For other values see help(summary.lm)
. Should you still want to see the coefficient of author F
that is
> fit$coefficients["authorF"]
authorF
0.6174314