I'm trying to run a simple single linear regression over a large number of variables, grouped according to another variable. Using the mtcars dataset as an example, I'd like to run a separate linear regression between mpg and each other variable (mpg ~ disp, mpg ~ hp, etc.), grouped by another variable (for example, cyl).
Running lm over each variable independently can easily be done using purrr::map (modified from this great tutorial - https://sebastiansauer.github.io/EDIT-multiple_lm_purrr_EDIT/):
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
select(-mpg) %>% #exclude outcome, leave predictors
map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>%
map_df(glance, .id='variable') %>%
select(variable, r.squared, p.value)
# A tibble: 10 x 3
variable r.squared p.value
<chr> <dbl> <dbl>
1 cyl 0.726 6.11e-10
2 disp 0.718 9.38e-10
3 hp 0.602 1.79e- 7
4 drat 0.464 1.78e- 5
5 wt 0.753 1.29e-10
6 qsec 0.175 1.71e- 2
7 vs 0.441 3.42e- 5
8 am 0.360 2.85e- 4
9 gear 0.231 5.40e- 3
10 carb 0.304 1.08e- 3
And running a linear model over grouped variables is also easy using map:
mtcars %>%
split(.$cyl) %>% #split by grouping variable
map(~ lm(mpg ~ wt, data = .)) %>%
map_df(broom::glance, .id='cyl') %>%
select(cyl, variable, r.squared, p.value)
# A tibble: 3 x 3
cyl r.squared p.value
<chr> <dbl> <dbl>
1 4 0.509 0.0137
2 6 0.465 0.0918
3 8 0.423 0.0118
So I can run by variable, or by group. However, I can't figure out how to combine these two (grouping everything by cyl, then running lm(mpg ~ each other variable, separately). I'd hoped to do something like this:
mtcars %>%
select(-mpg) %>% #exclude outcome, leave predictors
split(.$cyl) %>% # group by grouping variable
map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>% #run lm across all variables
map_df(glance, .id='cyl') %>%
select(cyl, variable, r.squared, p.value)
and get a result that gives me cyl(group), variable, r.squared, and p.value (a combination of 3 groups * 10 variables = 30 model outputs).
But split() turns the dataframe into a list, which the construction from part 1 [ map(~ lm(mtcars$mpg ~ .x, data = mtcars)) ] can't handle. I have tried to modify it so that it doesn't explicitly refer to the original data structure, but can't figure out a working solution. Any help is greatly appreciated!
CodePudding user response:
IIUC, you can use group_by
and group_modify
, with a map
inside that iterates over predictors.
If you can isolate your predictor variables in advance, it'll make it easier, as with ivs
in this solution.
library(tidyverse)
ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs
mtcars %>%
group_by(cyl) %>%
group_modify(function(data, key) {
map_df(ivs, function(iv) {
frml <- as.formula(paste("mpg", "~", iv))
lm(frml, data = data) %>% broom::glance()
}, .id = "iv")
}) %>%
select(cyl, iv, r.squared, p.value)
# A tibble: 27 × 4
# Groups: cyl [3]
cyl iv r.squared p.value
<dbl> <chr> <dbl> <dbl>
1 4 disp 0.648 0.00278
2 4 hp 0.274 0.0984
3 4 drat 0.180 0.193
4 4 wt 0.509 0.0137
5 4 qsec 0.0557 0.485
6 4 vs 0.00238 0.887
7 4 am 0.287 0.0892
8 4 gear 0.115 0.308
9 4 carb 0.0378 0.567
10 6 disp 0.0106 0.826
11 6 hp 0.0161 0.786
# ...