Automatically removing variables from dataframe based on VIF criteria using R-CodePudding

I have a series of data frames, each representing a linear model. I want to automatically remove columns from each data frame based on a threshold of 10 for the VIF criteria. A given data frame looks like this:

df_nn <- structure(list(capital = c(100, 101, 102, 103, 
104, 105, 106, 107, 108, 109, 
110, 111, 112, 113, 114, 115, 
116, 117, 118, 119, 120, 121, 
122, 123, 124, 125, 126, 127, 
128, 129, 130, 131, 132), IVAE = c(109.19, 
110.09, 111.84, 112.49, 111.99, 113.11, 111.89, 112.11, 112.75, 
113.7, 112.93, 112.43, 114.88, 114.5, 114.93, 115.13, 105.54, 
91.71, 87.93, 93.06, 96.74, 103.26, 106.76, 109.6, 110.74, 112, 
112.73, 114.97, 115.01, 114.67, 115.78, 114.52, 111.91), `Índice de Producción Industrial (IPI): Industrias Manufactureras, Explotación de Minas y Canteras y Otras Actividades Industriales` = c(101.4, 
103.4, 106.72, 108.45, 107.76, 107.25, 105.75, 107.03, 107.31, 
106.61, 106.95, 106.61, 110.18, 108.68, 109.66, 111.32, 100.02, 
76.77, 73.46, 81.99, 94.83, 100.64, 104.51, 106.74, 107.04, 108.75, 
110.8, 110.59, 111.25, 108.82, 110.03, 111.32, 107.61), Construcción = c(112.25, 
117.5, 124.32, 122.64, 121.21, 128.69, 122.28, 126.55, 120.13, 
137.47, 129.82, 126.83, 132.92, 131.72, 137.56, 130.89, 117.08, 
87.62, 67.49, 79.56, 88.97, 117.57, 110.01, 118.02, 117.61, 121.64, 
120.76, 120.99, 118.96, 122.7, 122.59, 101.2, 106.3), `Comercio, Transporte y Almacenamiento, Actividades de Alojamiento y de Servicio de Comidas` = c(112.2, 
113.03, 115.69, 113.74, 114.7, 115.93, 115.3, 114.25, 115.05, 
116.68, 114.84, 114.56, 116.58, 117.77, 119.19, 119.15, 103.41, 
76.66, 75.21, 90.32, 91.72, 97.53, 105.21, 110.43, 109.72, 112.41, 
114.05, 115.88, 117.29, 115.05, 114.69, 116.79, 109.68), `Actividades Inmobiliarias` = c(113.31, 
113.83, 114.69, 114.97, 115.98, 116.2, 116.22, 115.64, 115.79, 
115.95, 116.24, 117.6, 117.84, 115.35, 108.98, 105.89, 103.74, 
103.16, 102.5, 102.42, 102.41, 104.16, 107.74, 112.87, 116.57, 
115.68, 113.47, 112.41, 112.08, 112.42, 112.74, 113.21, 112.56
), `Actividades Profesionales, Científicas, Técnicas, Administrativas, de Apoyo y Otros Servicios` = c(111.84, 
111.92, 116.44, 117.77, 112.96, 114.64, 113.67, 112.33, 115.12, 
113.31, 114.14, 115.46, 117.17, 120.57, 124.26, 122.68, 99.51, 
86.36, 79.21, 81.56, 83.6, 88.71, 97.76, 98.16, 101.04, 102.68, 
108.37, 113.64, 114.82, 115.91, 118.35, 118.74, 109.14), empleo = c(851413, 
856079, 853309, 854541, 856040, 853881, 853328, 858454, 860200, 
861430, 865033, 867569, 874276, 870793, 872645, 876928, 873733, 
840029, 813159, 805474, 808920, 814118, 824284, 833293, 841311, 
842072, 848832, 854290, 859130, 860833, 865704, 873081, 881033
)), row.names = c(NA, -33L), class = c("tbl_df", "tbl", "data.frame"
))

Where "capital" is the dependent variable and the remaining columns are the independent variables, all of them numeric.

So far, I have tried the following function for a single data frame:

library(car)

vif_fun <- function(df){
             while(TRUE) {
                vifs <- vif(lm(capital ~. , data = df))
                if (max(vifs) < 10) {
                     break
                }
               highest <- c(names((which(vifs == max(vifs)))))
               df <- df[,-which(names(df) %in% highest)]

              }
            return(df)
              }

vif_fun(df_nn)

As long as there is a variable with a VIF above 10, the function should:

Identify the variable with the maximum VIF
Remove it from the data frame
Repeat until there are no more variables with a VIF above 10

However, whenever I run the function, I get the following error message:

Error in terms.formula(formula, data = data) : 
'.' in formula and no 'data' argument

I tried the function with the mtcars data set, replacing "capital" for "mpg" in the function and it worked. Any ideas of what might be going on?

CodePudding user response：

An easier option is to make use of clean_names from janitor which does replace the non-specific column names

vif_fun <- function(df){
             df <- janitor::clean_names(df)
             while(TRUE) {
                vifs <- vif(lm(capital ~. , data = df))
                if (max(vifs) < 10) {
                     break
                }
               highest <- c(names((which(vifs == max(vifs)))))
               df <- df[,-which(names(df) %in% highest)]

              }
            return(df)
              }

vif_fun(df_nn)

CodePudding user response：

The problem is that you have non-standard names in your data.frame (some of the columns contain spaces). This causes a problem because the names of the object returned by vif() do not exactly match the column name any more. The vif function wraps the non-standard column names in backticks but those backticks are not actually part of the column name in the data.frame. You can remove those ticks when doing the match, for example:

vif_fun <- function(df){
  untick <- function(x) gsub("^`|`$", "", x)
  while(TRUE) {
    vifs <- vif(lm(capital ~. , data = df))
    if (max(vifs) < 10) {
      break
    }
    highest <- untick(names((which(vifs == max(vifs)))))
    df <- df[,-which(names(df) %in% highest)]
    
  }
  return(df)
}