Run multiple similar models on different outcomes-CodePudding

I would like to run the same model specification on different outcomes in a neat way, instead of running a model for each outcome. I would also like to iteratively hold out one observation at a time (e.g. a county) from a model to check if single observations drive the results. I have tried creating a for loop but without luck so far.

library(lfe)

## Create long format dataset. Unit of analysis is county-year, 
## i.e. one observations equal a county in a given year. 
## Independent variable, x is a dummy (0, 1)

year <- c(2007, 2007, 2007, 2007, 2007, 2009, 2009, 2009, 2009, 2009)
county <- c("county1", "county2", "county3", "county4", "county5", 
           "county1", "county2", "county3", "county4", "county5")
x <- c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
y1 <- c(2.5, 8, 10, 7, 2, 3, 13, 17, 4.5, 1.3)
y2 <- c(6.5, 2, 3, 18, 2, 14, 7.6, 2.4, 8.2, 4.9)
y3 <- c(5.2, 2, 5, 7.5, 5, 9, 3, 1.7, 2.5, 5.3)

D <- data.frame(year, county, x, y1, y2, y3)

# I have multiple dependent variables: y1, y2, y3, y4 and so on. I only have one inde-
# pendent variable, x. I want to estimate the model specification below for each dependent variable in a smart way, without have to write it out each time  

m1 <- felm(y1 ~ x                                 # outcome regressed on treatment      
                 | factor(county)   factor(year)  # county and time fixed effects                       
                 | 0                              # no IVs                                 
                 | county,                        # SE clustered on the county                       
                    data = D)

# Furthermore, I'd like to iteratively hold out/remove one county or year while estimating a model, to check if they are driving the results

CodePudding user response：

Here's a function that should do it:

  library(lfe)
#> Loading required package: Matrix

## Create long format dataset. Unit of analysis is county-year, 
## i.e. one observations equal a county in a given year. 
## Independent variable, x is a dummy (0, 1)

year <- c(2007, 2007, 2007, 2007, 2007, 2008, 2008, 2008, 2008, 2008, 2009, 2009, 2009, 2009, 2009)
county <- c("county1", "county2", "county3", "county4", "county5", 
            "county1", "county2", "county3", "county4", "county5",
            "county1", "county2", "county3", "county4", "county5")
x <- c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0)
y1 <- c(2.5, 8, 10, 7, 2, 3, 13, 17, 4.5, 1.3, 4,7,2,3,5)
y2 <- c(6.5, 2, 3, 18, 2, 14, 7.6, 2.4, 8.2, 4.9, 5,2,4,6,2)
y3 <- c(5.2, 2, 5, 7.5, 5, 9, 3, 1.7, 2.5, 5.3, 8,7,3,4,6)

D <- data.frame(year, county, x, y1, y2, y3)

# I have multiple dependent variables: y1, y2, y3, y4 and so on. I only have one inde-
# pendent variable, x. I want to estimate the model specification below for each dependent variable in a smart way, without have to write it out each time  

m1 <- felm(y1 ~ x                                 # outcome regressed on treatment      
           | factor(county)   factor(year)  # county and time fixed effects                       
           | 0                              # no IVs                                 
           | county,                        # SE clustered on the county                       
           data = D)

jfun <- function(model, data, remove=NULL){
  if(is.null(remove)){stop("Must choose a variable whose values will be jackknifed out.\n")}
  dat <- get_all_vars(model, data)
  if(!is.null(remove) & !(remove %in% names(dat))){stop("The remove variable must be in the model.\n")}
  obs <- unique(dat[[remove]])
  res <- NULL
  for(i in 1:length(obs)){
    subd <- subset(dat, dat[[remove]] != obs[i])
    mod <- update(model, data=subd)
    res <- rbind(res, coef(mod))
  }
  cbind(data.frame(obs_removed = obs), res)
}

jfun(m1, D, "county")
#>   obs_removed         x
#> 1     county1 -1.050000
#> 2     county2 -1.250000
#> 3     county3 -1.163333
#> 4     county4 -3.991667
#> 5     county5 -0.562500
jfun(m1, D, "year")
#>   obs_removed          x
#> 1        2007 -3.4857143
#> 2        2008 -3.5000000
#> 3        2009  0.5083333

^{Created on 2022-03-06 by the reprex package (v2.0.1)}

The function jfun() takes a model object (that you want to jackknife), a dataset (used in the model) and a string variable name identifying the variable whose values you would like to jackknife. The function identifies all possible values of the jackknife variable and then in a loop, removes each one in turn saving the model coefficients.

CodePudding user response：

## 1. fitting models on different outcomes.

# My solution redefines the data frame to be passed in "data" at each iteration. The trick is to
# select only the desired columns.
model.list = vector(mode = "list", length = 3) # Pre-allocating list to store fitted models, as long as your outcomes.
j = 1 # Counter.
for (i in c("y1", "y2", "y3"))
{
  temp.dta = data.frame(y = D[, i], D[, (!colnames(D) %in%  c("y1", "y2", "y3"))]) # It selects the outcome at each iteration.
  model.list[[j]] <- felm(y ~ x | factor(county)   factor(year) | 0 | county, data = temp.dta) # Stores fit in list, j-th position.
  j = j   1 # Increase counter.
}

summary(model.list[[1]]) # Model fitted on y1.

## 2. fitting same model n times, with i-th observations removed, where i = 1, ..., n.

# With similar reasoning (i.e., redefining the data frame), we can omit one row at each iteration.
# For simplicity, focus on y1. 
model.list2 = vector(mode = "list", length = dim(D)[1]) # Pre-allocating list to store fitted models, as long as your data.
for (h in seq_len(dim(D)[1]))
{
  model.list2[[h]] <- felm(y1 ~ x | factor(county)   factor(year) | 0 | county, data = D[-h, ]) # Notice I am omitting the i-th row.
}

summary(model.list2[[1]]) # Model with first row omitted.

## 3. combining both ideas -> just combine both solutions (nested loops).

Maybe is not the most elegant solution, but it works, and it is easy to understand and implement.

Regarding the first question, we can use a for loop so to redefine the data frame we want to use at each iteration. The idea is to select only the columns we want to use for the fit, that is, covariates (which stay constant across iterations), and the desired outcome. Notice that I always name the outcome column as y, so I do not have to worry about changing the formula as well. With the data frame so defined (stored in temp.dta), we can fit all the models by setting data = temp.dta within felm(). Results are stored in the list model.list, which must be defined before the loop.

The same trick can be used to fit the model several times while dropping one observation at once. Now, rather than select columns, we select rows. In this case we do not need to redefine the data frame, as we can directly subset our sample in the data parameter.

Notice that for the second solution I focused on y1 for simplicity. If you want to fit the model for all three outcomes, and for each of them you want to repeat the operation by dropping one observation at once, just combine the solutions by implementing two nested loops. Sort of "the proof is left as exercise for the reader".