Home > front end >  Automatically removing variables from dataframe based on VIF criteria using R
Automatically removing variables from dataframe based on VIF criteria using R

Time:12-01

I have a series of data frames, each representing a linear model. I want to automatically remove columns from each data frame based on a threshold of 10 for the VIF criteria. A given data frame looks like this:

df_nn <- structure(list(capital = c(100, 101, 102, 103, 
104, 105, 106, 107, 108, 109, 
110, 111, 112, 113, 114, 115, 
116, 117, 118, 119, 120, 121, 
122, 123, 124, 125, 126, 127, 
128, 129, 130, 131, 132), IVAE = c(109.19, 
110.09, 111.84, 112.49, 111.99, 113.11, 111.89, 112.11, 112.75, 
113.7, 112.93, 112.43, 114.88, 114.5, 114.93, 115.13, 105.54, 
91.71, 87.93, 93.06, 96.74, 103.26, 106.76, 109.6, 110.74, 112, 
112.73, 114.97, 115.01, 114.67, 115.78, 114.52, 111.91), `Índice de Producción Industrial (IPI): Industrias Manufactureras, Explotación de Minas y Canteras y Otras Actividades Industriales` = c(101.4, 
103.4, 106.72, 108.45, 107.76, 107.25, 105.75, 107.03, 107.31, 
106.61, 106.95, 106.61, 110.18, 108.68, 109.66, 111.32, 100.02, 
76.77, 73.46, 81.99, 94.83, 100.64, 104.51, 106.74, 107.04, 108.75, 
110.8, 110.59, 111.25, 108.82, 110.03, 111.32, 107.61), Construcción = c(112.25, 
117.5, 124.32, 122.64, 121.21, 128.69, 122.28, 126.55, 120.13, 
137.47, 129.82, 126.83, 132.92, 131.72, 137.56, 130.89, 117.08, 
87.62, 67.49, 79.56, 88.97, 117.57, 110.01, 118.02, 117.61, 121.64, 
120.76, 120.99, 118.96, 122.7, 122.59, 101.2, 106.3), `Comercio, Transporte y Almacenamiento, Actividades de Alojamiento y de Servicio de Comidas` = c(112.2, 
113.03, 115.69, 113.74, 114.7, 115.93, 115.3, 114.25, 115.05, 
116.68, 114.84, 114.56, 116.58, 117.77, 119.19, 119.15, 103.41, 
76.66, 75.21, 90.32, 91.72, 97.53, 105.21, 110.43, 109.72, 112.41, 
114.05, 115.88, 117.29, 115.05, 114.69, 116.79, 109.68), `Actividades Inmobiliarias` = c(113.31, 
113.83, 114.69, 114.97, 115.98, 116.2, 116.22, 115.64, 115.79, 
115.95, 116.24, 117.6, 117.84, 115.35, 108.98, 105.89, 103.74, 
103.16, 102.5, 102.42, 102.41, 104.16, 107.74, 112.87, 116.57, 
115.68, 113.47, 112.41, 112.08, 112.42, 112.74, 113.21, 112.56
), `Actividades Profesionales, Científicas, Técnicas, Administrativas, de Apoyo y Otros Servicios` = c(111.84, 
111.92, 116.44, 117.77, 112.96, 114.64, 113.67, 112.33, 115.12, 
113.31, 114.14, 115.46, 117.17, 120.57, 124.26, 122.68, 99.51, 
86.36, 79.21, 81.56, 83.6, 88.71, 97.76, 98.16, 101.04, 102.68, 
108.37, 113.64, 114.82, 115.91, 118.35, 118.74, 109.14), empleo = c(851413, 
856079, 853309, 854541, 856040, 853881, 853328, 858454, 860200, 
861430, 865033, 867569, 874276, 870793, 872645, 876928, 873733, 
840029, 813159, 805474, 808920, 814118, 824284, 833293, 841311, 
842072, 848832, 854290, 859130, 860833, 865704, 873081, 881033
)), row.names = c(NA, -33L), class = c("tbl_df", "tbl", "data.frame"
))

Where "capital" is the dependent variable and the remaining columns are the independent variables, all of them numeric.

So far, I have tried the following function for a single data frame:

library(car)

vif_fun <- function(df){
             while(TRUE) {
                vifs <- vif(lm(capital ~. , data = df))
                if (max(vifs) < 10) {
                     break
                }
               highest <- c(names((which(vifs == max(vifs)))))
               df <- df[,-which(names(df) %in% highest)]

              }
            return(df)
              }

vif_fun(df_nn)

As long as there is a variable with a VIF above 10, the function should:

  • Identify the variable with the maximum VIF
  • Remove it from the data frame
  • Repeat until there are no more variables with a VIF above 10

However, whenever I run the function, I get the following error message:

Error in terms.formula(formula, data = data) : 
'.' in formula and no 'data' argument

I tried the function with the mtcars data set, replacing "capital" for "mpg" in the function and it worked. Any ideas of what might be going on?

CodePudding user response:

An easier option is to make use of clean_names from janitor which does replace the non-specific column names

vif_fun <- function(df){
             df <- janitor::clean_names(df)
             while(TRUE) {
                vifs <- vif(lm(capital ~. , data = df))
                if (max(vifs) < 10) {
                     break
                }
               highest <- c(names((which(vifs == max(vifs)))))
               df <- df[,-which(names(df) %in% highest)]

              }
            return(df)
              }

vif_fun(df_nn)

CodePudding user response:

The problem is that you have non-standard names in your data.frame (some of the columns contain spaces). This causes a problem because the names of the object returned by vif() do not exactly match the column name any more. The vif function wraps the non-standard column names in backticks but those backticks are not actually part of the column name in the data.frame. You can remove those ticks when doing the match, for example:

vif_fun <- function(df){
  untick <- function(x) gsub("^`|`$", "", x)
  while(TRUE) {
    vifs <- vif(lm(capital ~. , data = df))
    if (max(vifs) < 10) {
      break
    }
    highest <- untick(names((which(vifs == max(vifs)))))
    df <- df[,-which(names(df) %in% highest)]
    
  }
  return(df)
}
  • Related