Extracting coefficients from 100s of regressions using one split regression rather than a loop?-CodePudding

I need to run over 600 regressions, each on a different MECE group of the data (group takes values {1,2,...,623}). From each regression, I need to store coefficient estimates for all independent variables. I was able to do this by looping through regressions (see below); but, I'm finding this quite slow and I believe there is a better way:

# loop prep
formula <- "dv ~ iv_1   iv_2   iv_3 | fe"
ols_stored_coef <- matrix(0, 623, 3)
ols_stored_coef <- as.data.frame(ols_stored_coef )

# loop
for(i in 1:623) {
 #run regression:
 ols <- feols(as.formula(formula), subset(df, group==i))
 # generate coefficients:
 ols_coef <- summary(ols)$coefficients
 ols_coef <- data.frame(as.list(ols_coef))
 # store coefficients:
 ols_stored_coef[i,1] = ols_coef[1,1]
 ols_stored_coef[i,2] = ols_coef[1,2]
 ols_stored_coef[i,3] = ols_coef[1,3]
}

This works, but it takes about 10 minutes to run (there are around 6 million observations and 623 MECE groups). However, I know that the following command estimates all 623 regressions in about 1 minute:

ols_split <- feols(as.formula(formula), df, split=~group)

Regression data is stored all together in a single "List of 623." I am able to extract coefficients per group via the following, where X is the group value.

ols_split $`sample.var: store; sample: X`$coefficients

In an ideal world, I could run this split feols(), and then store the coefficients via looping:

for(i in 1:623) {
  ols_coef <- ols_split $`sample.var: store; sample: i`$coefficients
  ols_coef <- data.frame(as.list(ols_coef))
  # store coefficients:
  ols_stored_coef[i,1] = ols_coef[1,1]
  ols_stored_coef[i,2] = ols_coef[1,2]
  ols_stored_coef[i,3] = ols_coef[1,3]
}

However, because i is in quotations `` I believe it is being read as text and thus not working.

Is there any way I can use the ols_split List of 623 regression results to extract coefficients?

CodePudding user response：

The fixest package that you've used has some in build functions to support this. Here's my example based from yours:

df <- tibble(
  dv = rnorm(1000),
  iv_1 = rnorm(1000),
  iv_2 = rnorm(1000),
  iv_3 = rnorm(1000),
  fe   = 1,
  group = sample(LETTERS, 1000, replace  = TRUE)
)

formula <- "dv ~ iv_1   iv_2   iv_3 | fe"
ols_stored_coef <- matrix(0, 623, 3)
ols_stored_coef <- as.data.frame(ols_stored_coef )
ols_split <- fixest::feols(as.formula(formula), df, split=~group)

out <- fixest::coeftable(ols_split) 
head(out)

  id sample.var sample coefficient    Estimate Std. Error    t value  Pr(>|t|)
1  1      group      A        iv_1 -0.04816492  0.2019670 -0.2384791 0.8133102
2  1      group      A        iv_2 -0.18081949  0.1982410 -0.9121193 0.3697786
3  1      group      A        iv_3  0.04826683  0.1961902  0.2460206 0.8075269
4  2      group      B        iv_1 -0.15561382  0.1824392 -0.8529625 0.3993197
5  2      group      B        iv_2  0.06064802  0.2348541  0.2582370 0.7976946
6  2      group      B        iv_3 -0.07948869  0.1981408 -0.4011728 0.6906643

Of course, if this format isn't what you want and you really do want a matrix that's trivial with some wrangling from here. i.e.

m <- matrix(out$Estimate, ncol = length(unique(out$coefficient)), byrow = TRUE)
colnames(m) <- unique(out$coefficient)
rownames(m) <- unique(out$sample)

head(m)

CodePudding user response：

I would suggest another way, using tidyverse tools. I'm using the gapminder dataset for reproducibility. If you have data that you can provide as an example, it can be applied to fit this as well:

library(gapminder)
library(purrr)
library(broom)
library(dplyr)
library(tidyr)

gapminder |> 
  group_by(country) |> 
  nest() |> 
  mutate(fit = map(data, ~ lm(lifeExp ~ gdpPercap, data = .)),
         coefs = map(fit, tidy)) |> 
  unnest(coefs)

# A tibble: 284 × 8
# Groups:   country [142]
   country     data              fit    term        estimate std.error statistic      p.value
   <fct>       <list>            <list> <chr>          <dbl>     <dbl>     <dbl>        <dbl>
 1 Afghanistan <tibble [12 × 5]> <lm>   (Intercept) 39.3     12.0          3.26  0.00857     
 2 Afghanistan <tibble [12 × 5]> <lm>   gdpPercap   -0.00224  0.0149      -0.151 0.883       
 3 Albania     <tibble [12 × 5]> <lm>   (Intercept) 54.0      3.16        17.1   0.0000000101
 4 Albania     <tibble [12 × 5]> <lm>   gdpPercap    0.00444  0.000917     4.84  0.000682    
 5 Algeria     <tibble [12 × 5]> <lm>   (Intercept) 27.4      4.90         5.60  0.000226    
 6 Algeria     <tibble [12 × 5]> <lm>   gdpPercap    0.00714  0.00106      6.71  0.0000533   
 7 Angola      <tibble [12 × 5]> <lm>   (Intercept) 41.6      3.91        10.6   0.000000899 
 8 Angola      <tibble [12 × 5]> <lm>   gdpPercap   -0.00103  0.00104     -0.998 0.342       
 9 Argentina   <tibble [12 × 5]> <lm>   (Intercept) 52.3      3.60        14.5   0.0000000479
10 Argentina   <tibble [12 × 5]> <lm>   gdpPercap    0.00187  0.000395     4.74  0.000797    
# … with 274 more rows
# ℹ Use `print(n = ...)` to see more rows

By using nest() and map() (base version is lapply()), you can iterate through each variable you are grouping by to fit the model and extract the coefficients and other information, using broom::tidy().