Identifying and 'capturing' observations used in a multiple regression-CodePudding

I have a large dataset and ran a multiple regression with a large number of (but not all) available variables. I am trying to run a simple regression for comparison and need to use the same observations as those in the multiple regression.

What is the easiest/best way to do this? I was thinking I could create a subset containing just complete observations on the variables in the multiple regression and run both the multiple regression and simple regression on that subset, but I can't figure out how to do that.

Perhaps there is an even easier way to just identify and 'select' the observations used in the multiple regression?

I have done some extensive googling on the subject but can't find a solution so far.

CodePudding user response：

You can accomplish this by using the function model.frame(). See ?model.frame.

model.frame (a generic function) and its methods return a data.frame with the variables needed to use formula and any ... arguments.

library(dplyr)
data(storms)
nrow(storms) # pretty big
#> [1] 11859

# multiple regression
fit1 <- lm(pressure ~ wind   year   month   hurricane_force_diameter, data = storms)
df_used_in_fit1 <- model.frame(fit1) %>% as.data.frame()
nrow(df_used_in_fit1) # smaller because of NA values
#> [1] 5350

# simpler regression
fit2 <- lm(pressure ~ wind, data = df_used_in_fit1)
nrow(model.frame(fit2))
#> [1] 5350

Note that model.frame will only include variables that we used in the original lm model.

names(df_used_in_fit1)
[1] "pressure"                 "wind"                     "year"                    
[4] "month"                    "hurricane_force_diameter"