I have the following dataset:
df <- data.frame(row_id = c(100, 101, 102, 103, 104, 105, 106, 107, 108, 109),
level = c(1000,2000,3000,4000,5000,6000,7000,8000,9000,10000),
col1 = c(1,0,1,1,1,0,0,1,1,0),
col2 = c(1,1,1,0,0,1,1,1,0,0),
col3 = c(0,0,1,0,0,1,1,1,1,0),
col4 = c(1,1,1,0,0,1,0,1,1,1),
col5 = c(1,1,1,0,1,0,1,0,0,1))
I would like to do a linear regression on the variable level
with each of the other columns with prefix col
. I would like to use the for loop function to do this instead of doing the following:
lm1<-lm(level~col1, data=df)
lm2<-lm(level~col2, data=df)
lm3<-lm(level~col3, data=df)
lm4<-lm(level~col4, data=df)
lm5<-lm(level~col5, data=df)
Any help would be much appreciated, thanks!
CodePudding user response:
First we need a way to create a formula given the variable we picked. One way to do this:
as.formula(paste0("level ~", var))
where var
is a variable like "col1"
.
Now we just need to create the loop for each model. If you want to do this using for
loops, you can do something like this:
models = list()
# Create a vector of the explanatory variables
variables = setdiff(names(df), c("row_id", "level"))
for (var in variables) {
models[[var]] = lm(
as.formula(paste0("level ~ ", var)),
data = df
)
}
models
is a list containing each model - for example you can access the model using col3
by models$col3
:
> summary(models$col3)
Call:
lm(formula = as.formula(paste0("level ~ ", var)), data = df)
Residuals:
Min 1Q Median 3Q Max
-3600 -1950 0 1200 5600
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4400 1327 3.317 0.0106 *
col3 2200 1876 1.173 0.2747
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2966 on 8 degrees of freedom
Multiple R-squared: 0.1467, Adjusted R-squared: 0.04
F-statistic: 1.375 on 1 and 8 DF, p-value: 0.2747
There are lots of improvements you can make to this approach as the requirements become more complicated, but this is a good start.
CodePudding user response:
If you are only interested in the coefficients, you can do this by just reshaping the data. You do not need a for loop. There will be some constraints on the standard error estimation. The coefficients will be correct:
lm(level~ind/values-1,cbind(df[1:2], stack(df, -(1:2))))
Call:
lm(formula = level ~ ind/values - 1, data = cbind(df[1:2], stack(df,
-(1:2))))
Coefficients:
indcol1 indcol2 indcol3 indcol4
6250.0 7000.0 4400.0 5333.3
indcol5 indcol1:values indcol2:values indcol3:values
6750.0 -1250.0 -2500.0 2200.0
indcol4:values indcol5:values
238.1 -2083.3
fhe coefficients are as follows: Indcol1 is the intercept for col1 wile indcol1:values is the coefficient.
Compare this with the results you have
Also you could do:
lapply(df[-(1:2)], function(x)lm(df$level~x))
The problem with this is that you will not know the variable names.
Onother way:
lapply(names(df)[-(1:2)], function(x)lm(reformulate(x, 'level'), df))
CodePudding user response:
df <-
data.frame(
row_id = c(100, 101, 102, 103, 104, 105, 106, 107, 108, 109),
level = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000),
col1 = c(1, 0, 1, 1, 1, 0, 0, 1, 1, 0),
col2 = c(1, 1, 1, 0, 0, 1, 1, 1, 0, 0),
col3 = c(0, 0, 1, 0, 0, 1, 1, 1, 1, 0),
col4 = c(1, 1, 1, 0, 0, 1, 0, 1, 1, 1),
col5 = c(1, 1, 1, 0, 1, 0, 1, 0, 0, 1)
)
library(tidyverse)
VARS <- grep("^col", names(df), value = TRUE) %>%
set_names()
map(VARS, ~lm(reformulate(.x, "level"), data = df)) %>%
map(summary)
#> $col1
#>
#> Call:
#> lm(formula = reformulate(.x, "level"), data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4250 -1750 -125 2438 4000
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6250 1569 3.984 0.00404 **
#> col1 -1250 2025 -0.617 0.55425
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3137 on 8 degrees of freedom
#> Multiple R-squared: 0.04545, Adjusted R-squared: -0.07386
#> F-statistic: 0.381 on 1 and 8 DF, p-value: 0.5543
#>
#>
#> $col2
#>
#> Call:
#> lm(formula = reformulate(.x, "level"), data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3500 -2375 0 2375 3500
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 7000 1452 4.820 0.00132 **
#> col2 -2500 1875 -1.333 0.21914
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2905 on 8 degrees of freedom
#> Multiple R-squared: 0.1818, Adjusted R-squared: 0.07955
#> F-statistic: 1.778 on 1 and 8 DF, p-value: 0.2191
#>
#> ...
Created on 2021-12-21 by the reprex package (v2.0.1)