Looping linear regression in R for specific columns in dataset-CodePudding

I have the following dataset:

df <- data.frame(row_id = c(100, 101, 102, 103, 104, 105, 106, 107, 108, 109),
     level = c(1000,2000,3000,4000,5000,6000,7000,8000,9000,10000),
     col1 = c(1,0,1,1,1,0,0,1,1,0),
     col2 = c(1,1,1,0,0,1,1,1,0,0),
     col3 = c(0,0,1,0,0,1,1,1,1,0),
     col4 = c(1,1,1,0,0,1,0,1,1,1),
     col5 = c(1,1,1,0,1,0,1,0,0,1))

I would like to do a linear regression on the variable level with each of the other columns with prefix col. I would like to use the for loop function to do this instead of doing the following:

lm1<-lm(level~col1, data=df)
lm2<-lm(level~col2, data=df)
lm3<-lm(level~col3, data=df)
lm4<-lm(level~col4, data=df)
lm5<-lm(level~col5, data=df)

Any help would be much appreciated, thanks!

CodePudding user response：

First we need a way to create a formula given the variable we picked. One way to do this:

as.formula(paste0("level ~", var))

where var is a variable like "col1".

Now we just need to create the loop for each model. If you want to do this using for loops, you can do something like this:

models = list()
# Create a vector of the explanatory variables
variables = setdiff(names(df), c("row_id", "level"))

for (var in variables) {
  models[[var]] = lm(
    as.formula(paste0("level ~ ", var)),
    data = df
  )
}

models is a list containing each model - for example you can access the model using col3 by models$col3:

> summary(models$col3)

Call:
lm(formula = as.formula(paste0("level ~ ", var)), data = df)

Residuals:
   Min     1Q Median     3Q    Max 
 -3600  -1950      0   1200   5600 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)     4400       1327   3.317   0.0106 *
col3            2200       1876   1.173   0.2747  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2966 on 8 degrees of freedom
Multiple R-squared:  0.1467,    Adjusted R-squared:   0.04 
F-statistic: 1.375 on 1 and 8 DF,  p-value: 0.2747

There are lots of improvements you can make to this approach as the requirements become more complicated, but this is a good start.

CodePudding user response：

If you are only interested in the coefficients, you can do this by just reshaping the data. You do not need a for loop. There will be some constraints on the standard error estimation. The coefficients will be correct:

lm(level~ind/values-1,cbind(df[1:2], stack(df, -(1:2))))

Call:
  lm(formula = level ~ ind/values - 1, data = cbind(df[1:2], stack(df, 
    -(1:2)))) 
Coefficients:
       indcol1         indcol2         indcol3         indcol4  
        6250.0          7000.0          4400.0          5333.3  
       indcol5  indcol1:values  indcol2:values  indcol3:values  
        6750.0         -1250.0         -2500.0          2200.0  
indcol4:values  indcol5:values  
         238.1         -2083.3

fhe coefficients are as follows: Indcol1 is the intercept for col1 wile indcol1:values is the coefficient.

Compare this with the results you have

Also you could do:

  lapply(df[-(1:2)], function(x)lm(df$level~x))

The problem with this is that you will not know the variable names.

Onother way:

lapply(names(df)[-(1:2)], function(x)lm(reformulate(x, 'level'), df))

CodePudding user response：

df <-
  data.frame(
    row_id = c(100, 101, 102, 103, 104, 105, 106, 107, 108, 109),
    level = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000),
    col1 = c(1, 0, 1, 1, 1, 0, 0, 1, 1, 0),
    col2 = c(1, 1, 1, 0, 0, 1, 1, 1, 0, 0),
    col3 = c(0, 0, 1, 0, 0, 1, 1, 1, 1, 0),
    col4 = c(1, 1, 1, 0, 0, 1, 0, 1, 1, 1),
    col5 = c(1, 1, 1, 0, 1, 0, 1, 0, 0, 1)
  )

library(tidyverse)

VARS <- grep("^col", names(df), value = TRUE) %>% 
  set_names()

map(VARS, ~lm(reformulate(.x, "level"), data = df)) %>% 
  map(summary)
#> $col1
#> 
#> Call:
#> lm(formula = reformulate(.x, "level"), data = df)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -4250  -1750   -125   2438   4000 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)     6250       1569   3.984  0.00404 **
#> col1           -1250       2025  -0.617  0.55425   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3137 on 8 degrees of freedom
#> Multiple R-squared:  0.04545,    Adjusted R-squared:  -0.07386 
#> F-statistic: 0.381 on 1 and 8 DF,  p-value: 0.5543
#> 
#> 
#> $col2
#> 
#> Call:
#> lm(formula = reformulate(.x, "level"), data = df)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -3500  -2375      0   2375   3500 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)     7000       1452   4.820  0.00132 **
#> col2           -2500       1875  -1.333  0.21914   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2905 on 8 degrees of freedom
#> Multiple R-squared:  0.1818, Adjusted R-squared:  0.07955 
#> F-statistic: 1.778 on 1 and 8 DF,  p-value: 0.2191
#> 
#> ...

^{Created on 2021-12-21 by the reprex package (v2.0.1)}