I have a large dataset and ran a multiple regression with a large number of (but not all) available variables. I am trying to run a simple regression for comparison and need to use the same observations as those in the multiple regression.
What is the easiest/best way to do this? I was thinking I could create a subset containing just complete observations on the variables in the multiple regression and run both the multiple regression and simple regression on that subset, but I can't figure out how to do that.
Perhaps there is an even easier way to just identify and 'select' the observations used in the multiple regression?
I have done some extensive googling on the subject but can't find a solution so far.
CodePudding user response:
You can accomplish this by using the function model.frame()
. See ?model.frame
.
model.frame (a generic function) and its methods return a data.frame with the variables needed to use formula and any ... arguments.
library(dplyr)
data(storms)
nrow(storms) # pretty big
#> [1] 11859
# multiple regression
fit1 <- lm(pressure ~ wind year month hurricane_force_diameter, data = storms)
df_used_in_fit1 <- model.frame(fit1) %>% as.data.frame()
nrow(df_used_in_fit1) # smaller because of NA values
#> [1] 5350
# simpler regression
fit2 <- lm(pressure ~ wind, data = df_used_in_fit1)
nrow(model.frame(fit2))
#> [1] 5350
Note that model.frame
will only include variables that we used in the original lm
model.
names(df_used_in_fit1)
[1] "pressure" "wind" "year"
[4] "month" "hurricane_force_diameter"