Home > Software engineering >  R loop over linear regression
R loop over linear regression

Time:03-20

I have looked over the forum but couldn't find what I am looking for.


I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.

Assume I work with the following data:

    num value person1 person2 person3
0   1   229   29      81      0
1   2   203   17      75      0
2   3   244   62      0       55

and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.

And so my initial attempt was to:

column <- names(df)[-2]
for(i in 3:5){
  temp <- df[,c("value", column[i])]
  lm.test <- lm(value ~ ., data = temp)
  i   1 
}

However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:

debug at #3: temp <- df[,c("value", column[i])]

suggesting that there's something wrong with line 3?

If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.

CodePudding user response:

You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:

vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
    lm_formula <- as.formula(paste0("mpg ~ ", col))
    lm(lm_formula, data = mtcars)
})

You can then again iterate over lm_results to get the r.squared:

lapply(lm_results, function(x) summary(x)$r.squared)

CodePudding user response:

Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:

library(broom)
library(purrr)

lm.test <- map_dfr(
    set_names(names(df)[-2]),
    ~ glance(lm(
      as.formula(paste("value ~", .x)),
      data = df
     )),
    .id = "predictor"
)

Result:

# A tibble: 4 x 13
  predictor r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
  <chr>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl>
1 num           0.131       -0.739   27.4     0.150   0.765     1  -12.5  31.1
2 person1       0.836        0.672   11.9     5.10    0.265     1  -10.0  26.1
3 person2       0.542        0.0831  19.9     1.18    0.474     1  -11.6  29.2
4 person3       0.607        0.215   18.4     1.55    0.431     1  -11.3  28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

  • Related