Home > Software design >  How to write a function that will run multiple regression models of the same type with different dep
How to write a function that will run multiple regression models of the same type with different dep

Time:09-21

I am trying to write a function that will run multiple regressions and then store the outputs in a vector. What I want is for the function to pick the dependent variables from a list that I will provide, and then run the regressions on the same right hand-side variables. Not sure how to go about doing this. Any hints will be appreciated.

my_data <- data.frame(x1=(1:10)   rnorm(10, 3, 1.5), x2=25/3   rnorm(10, 0, 1), 
                      dep.var1=seq(5, 28, 2.5), dep.var2=seq(100, -20, -12.5), 
                      dep.var3=seq(1, 25, 2.5))

## The following is a list that tells the function
dep.var <- list(dep.var1=my_data$dep.var1, dep.var2=my_data$dep.var2)
## which dependent variables to use from my_data

all_models <- function(dep.var){lm(dep.var ~ x1   x2, data=my_data)}

model <- sapply(dep.var, all_models) ## The "sapply" here tells the function to 
## take the dependent variables from the list dep.var.

I want the "model" list to have two objects: model1 with dep.var1 and model2 with dep.var2. Then as required, I will use summary(model#) to see the regression output.

I know that this in theory works when a vector is used (i.e., p):

p <- seq(0.25, 0.95, 0.05)

s <- function(p) {1 - pnorm(35, p*1*44, sqrt(44)*sqrt(p*(1 - p)))}

f <- sapply(p, s)

But I can't get the whole thing to work as required for my regression models. It works somewhat because you can run and check "model" and it will show you the two regression outputs - but it is horrible. And the "model" does not show the regression specification, i.e., dep.var1 ~ x1 x2.

CodePudding user response:

Consider reformulate to dynamically change model formulas using character values for lm calls:

# VECTOR OF COLUMN NAMES (NOT VALUES)
dep.vars <- c("dep.var1", "dep.var2")
 
# USER-DEFINED METHOD TO PROCESS DIFFERENT DEP VAR
run_model <- function(dep.var) {
    fml <- reformulate(c("x1", "x2"), dep.var)
    lm(fml, data=data)
}
                     
# NAMED LIST OF MODELS
all_models <- sapply(dep.vars, run_model, simplify = FALSE)

# OUTPUT RESULTS
all_models$dep.var1
all_models$dep.var2
...

From there, you can run further extractions or processes across model objects:

# NAMED LIST OF MODEL SUMMARIES
all_summaries <- lapply(all_models, summary)

all_summaries$dep.var1
all_summaries$dep.var2
...

# NAMED LIST OF MODEL COEFFICIENTS
all_coefficients <- lapply(all_models, `[`, "coefficients")

all_coefficients$dep.var1
all_coefficients$dep.var2
...

CodePudding user response:

You could sapply over the names of the dependent vars which you could nicely identify with grep. In lm use reformulate to build the formula.

sapply(grep('^dep', names(my_data), value=TRUE), \(x) 
       lm(reformulate(c('x1', 'x2'), x), my_data))
#               dep.var1           dep.var2           dep.var3          
# coefficients  numeric,3          numeric,3          numeric,3         
# residuals     numeric,10         numeric,10         numeric,10        
# effects       numeric,10         numeric,10         numeric,10        
# rank          3                  3                  3                 
# fitted.values numeric,10         numeric,10         numeric,10        
# assign        integer,3          integer,3          integer,3         
# qr            qr,5               qr,5               qr,5              
# df.residual   7                  7                  7                 
# xlevels       list,0             list,0             list,0            
# call          expression         expression         expression        
# terms         dep.var1 ~ x1   x2 dep.var2 ~ x1   x2 dep.var3 ~ x1   x2
# model         data.frame,3       data.frame,3       data.frame,3   

The dep.var* appear nicely in the result.

However, you probably want to use lapply and pipe it into setNames() to get the list elements named. Instead of grep you may of course define the dependent variables manually. To get a clean formular call stored, we use a trick once @g-grothendieck taught me using do.call.

dv <- as.list(grep('^dep', names(my_data), value=TRUE)[1:2])
res <- lapply(dv, \(x) {
  f <- reformulate(c('x1', 'x2'), x)
  do.call('lm', list(f, quote(my_data)))
}) |>
  setNames(dv)
res
# $dep.var1
# 
# Call:
#   lm(formula = dep.var1 ~ x1   x2, data = my_data)
# 
# Coefficients:
#   (Intercept)           x1           x2  
# -4.7450       2.3398       0.2747  
# 
# 
# $dep.var2
# 
# Call:
#   lm(formula = dep.var2 ~ x1   x2, data = my_data)
# 
# Coefficients:
#   (Intercept)           x1           x2  
# 148.725      -11.699       -1.373 

This allows you to get the summary() of the objects, which probably is what you want.

   summary(res$dep.var1)
# Call:
#   lm(formula = dep.var1 ~ x1   x2, data = my_data)
# 
# Residuals:
#   Min      1Q  Median      3Q     Max 
# -2.8830 -1.8345 -0.2326  1.4335  4.2452 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -4.7450     7.2884  -0.651    0.536    
# x1            2.3398     0.2836   8.251 7.48e-05 ***
#   x2            0.2747     0.7526   0.365    0.726    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.55 on 7 degrees of freedom
# Multiple R-squared:  0.9117,  Adjusted R-squared:  0.8865 
# F-statistic: 36.14 on 2 and 7 DF,  p-value: 0.0002046

Finally you could wrap it in a function

calc_models <- \(dv) {
  lapply(dv, \(x) {
    f <- reformulate(c('x1', 'x2'), x)
    do.call('lm', list(f, quote(my_data)))
  }) |>
    setNames(dv)
}

calc_models(list('dep.var1', 'dep.var2'))

CodePudding user response:

Here is a way how you could iterate through your dataframe and apply the function to the group you define (here dep.var) and save the different models in a dataframe:

library(tidyverse)
library(broom)
my_data %>% 
    pivot_longer(
        starts_with("dep"),
        names_to = "group",
        values_to = "dep.var"
    ) %>% 
    mutate(group = as.factor(group)) %>% 
    group_by(group) %>% 
    group_split() %>% 
    map_dfr(.f = function(df) {
        lm(dep.var ~ x1   x2, data = df) %>%
             tidy() %>% # first output 
            #glance() %>% # second output
            add_column(group = unique(df$group), .before=1)
    })

dataframe output:

# A tibble: 9 x 6
  group    term        estimate std.error statistic  p.value
  <fct>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 dep.var1 (Intercept)   -5.29     11.6      -0.456 0.662   
2 dep.var1 x1             2.11      0.268     7.87  0.000101
3 dep.var1 x2             0.538     1.23      0.437 0.675   
4 dep.var2 (Intercept)  151.       57.9       2.61  0.0347  
5 dep.var2 x1           -10.6       1.34     -7.87  0.000101
6 dep.var2 x2            -2.69      6.15     -0.437 0.675   
7 dep.var3 (Intercept)   -9.29     11.6      -0.802 0.449   
8 dep.var3 x1             2.11      0.268     7.87  0.000101
9 dep.var3 x2             0.538     1.23      0.437 0.675 

list output:

[[1]]
# A tibble: 3 x 6
  group    term        estimate std.error statistic  p.value
  <fct>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 dep.var1 (Intercept)   -5.29     11.6      -0.456 0.662   
2 dep.var1 x1             2.11      0.268     7.87  0.000101
3 dep.var1 x2             0.538     1.23      0.437 0.675   

[[2]]
# A tibble: 3 x 6
  group    term        estimate std.error statistic  p.value
  <fct>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 dep.var2 (Intercept)   151.       57.9      2.61  0.0347  
2 dep.var2 x1            -10.6       1.34    -7.87  0.000101
3 dep.var2 x2             -2.69      6.15    -0.437 0.675   

[[3]]
# A tibble: 3 x 6
  group    term        estimate std.error statistic  p.value
  <fct>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 dep.var3 (Intercept)   -9.29     11.6      -0.802 0.449   
2 dep.var3 x1             2.11      0.268     7.87  0.000101
3 dep.var3 x2             0.538     1.23      0.437 0.675 

glance output:

  group    r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <fct>        <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 dep.var1     0.927         0.906  2.32      44.3 0.000106     2  -20.8  49.7  50.9     37.8           7    10
2 dep.var2     0.927         0.906 11.6       44.3 0.000106     2  -36.9  81.9  83.1    944.            7    10
3 dep.var3     0.927         0.906  2.32      44.3 0.000106     2  -20.8  49.7  50.9     37.8           7    10
  • Related