I was reading this article and they used the following code to remove a variable/column from data:
data(airquality)
# using subset()
summary(lm(Ozone ~., data = subset(airquality, select = -Solar.R)))
# direct manipulation
summary(lm(Ozone ~. -Solar.R, data = airquality))
My initial thought was that both do the same thing by removing the variable Solar.R
from the lm
but they seem to produce different results. What is the difference between the two approaches? Why do they create different regression results?
CodePudding user response:
In your formula, .
refers to all the other variables (other than Ozone
). By then subtracting Solar.R
from .
(i.e. Ozone~.-Solar.R
), you are effectively doing this:
lm(Ozone~Wind-Solar.R Temp-Solar.R Month-Solar.R Day-Solar.R Solar.R-Solar.R, data=airquality)
Notice that the result of this is the same as your second model.
So, in short, both of these approaches (your second model, and my written-out version of it) are simply doing the same as your first model, but causing more rows to drop out.. Since there are five additional rows in the dataset where Solar.R is missing (but Ozone is not missing), by substracting Solar.R from Wind thru Day, those five rows drop out, increasing the total dropped from 37 to 42.
Notice that if Solar.R did not have any missing values (let's say I filled the missing values with the mean of non missing Solar.R values, as below, or with ANY value), then your first and second models would be identical; Specifically:
lm(Ozone~., data=select(airquality,-Solar.R))
Call:
lm(formula = Ozone ~ ., data = select(airquality, -Solar.R))
Coefficients:
(Intercept) Wind Temp Month Day
-70.1051 -3.0516 2.0984 -3.5209 0.2747
versus:
lm(Ozone~.-Solar.R, data=
airquality %>%
mutate(Solar.R = if_else(is.na(Solar.R), mean(Solar.R, na.rm=T), as.double(Solar.R)))
)
Call:
lm(formula = Ozone ~ . - Solar.R, data = airquality %>% mutate(Solar.R = if_else(is.na(Solar.R),
mean(Solar.R, na.rm = T), as.double(Solar.R))))
Coefficients:
(Intercept) Wind Temp Month Day
-70.1051 -3.0516 2.0984 -3.5209 0.2747