Home > Enterprise >  What is the difference between using subset() to remove a column and removing a variable directly wi
What is the difference between using subset() to remove a column and removing a variable directly wi

Time:07-16

I was reading this article and they used the following code to remove a variable/column from data:

data(airquality)

# using subset()
summary(lm(Ozone ~., data = subset(airquality, select = -Solar.R)))

# direct manipulation
summary(lm(Ozone ~. -Solar.R, data = airquality))

My initial thought was that both do the same thing by removing the variable Solar.R from the lm but they seem to produce different results. What is the difference between the two approaches? Why do they create different regression results?

CodePudding user response:

In your formula, . refers to all the other variables (other than Ozone). By then subtracting Solar.R from . (i.e. Ozone~.-Solar.R), you are effectively doing this:

lm(Ozone~Wind-Solar.R Temp-Solar.R Month-Solar.R   Day-Solar.R   Solar.R-Solar.R, data=airquality)

Notice that the result of this is the same as your second model.

So, in short, both of these approaches (your second model, and my written-out version of it) are simply doing the same as your first model, but causing more rows to drop out.. Since there are five additional rows in the dataset where Solar.R is missing (but Ozone is not missing), by substracting Solar.R from Wind thru Day, those five rows drop out, increasing the total dropped from 37 to 42.

Notice that if Solar.R did not have any missing values (let's say I filled the missing values with the mean of non missing Solar.R values, as below, or with ANY value), then your first and second models would be identical; Specifically:

lm(Ozone~., data=select(airquality,-Solar.R))

Call:
lm(formula = Ozone ~ ., data = select(airquality, -Solar.R))

Coefficients:
(Intercept)         Wind         Temp        Month          Day  
   -70.1051      -3.0516       2.0984      -3.5209       0.2747 

versus:

lm(Ozone~.-Solar.R, data=
     airquality %>% 
     mutate(Solar.R = if_else(is.na(Solar.R), mean(Solar.R, na.rm=T), as.double(Solar.R)))
   )

Call:
lm(formula = Ozone ~ . - Solar.R, data = airquality %>% mutate(Solar.R = if_else(is.na(Solar.R), 
    mean(Solar.R, na.rm = T), as.double(Solar.R))))

Coefficients:
(Intercept)         Wind         Temp        Month          Day  
   -70.1051      -3.0516       2.0984      -3.5209       0.2747
  • Related