Confusion about the code case in MatchIt/vignettes/estimating-effects-CodePudding

About the example given in the document: https://cran.r-project.org/web/packages/MatchIt/vignettes/estimating-effects.html The code:

gen_X <- function(n) {
  X <- matrix(rnorm(9 * n), nrow = n, ncol = 9)
  X[,5] <- as.numeric(X[,5] < .5)
  X
}

gen_A <- function(X) {
  LP_A <- - 1.2   log(2)*X[,1] - log(1.5)*X[,2]   log(2)*X[,4] - log(2.4)*X[,5]   log(2)*X[,7] - log(1.5)*X[,8]
  P_A <- plogis(LP_A)
  rbinom(nrow(X), 1, P_A)
}


gen_Y_C <- function(A, X) {
  2*A   2*X[,1]   2*X[,2]   2*X[,3]   1*X[,4]   2*X[,5]   1*X[,6]   rnorm(length(A), 0, 5)
}

gen_Y_B <- function(A, X) {
  LP_B <- -2   log(2.4)*A   log(2)*X[,1]   log(2)*X[,2]   log(2)*X[,3]   log(1.5)*X[,4]   log(2.4)*X[,5]   log(1.5)*X[,6]
  P_B <- plogis(LP_B)
  rbinom(length(A), 1, P_B)
}

gen_Y_S <- function(A, X) {
  LP_S <- -2   log(2.4)*A   log(2)*X[,1]   log(2)*X[,2]   log(2)*X[,3]   log(1.5)*X[,4]   log(2.4)*X[,5]   log(1.5)*X[,6]
  sqrt(-log(runif(length(A)))*2e4*exp(-LP_S))
}


set.seed(19599)

n <- 2000
X <- gen_X(n)
A <- gen_A(X)

Y_C <- gen_Y_C(A, X)
Y_B <- gen_Y_B(A, X)
Y_S <- gen_Y_S(A, X)

d <- data.frame(A, X, Y_C, Y_B, Y_S)
#=================================================================
pair_ids <- levels(md$subclass)

est_fun <- function(pairs, i) {
  
#Compute number of times each pair is present
numreps <- table(pairs[i])
  
#For each pair p, copy corresponding md row indices numreps[p] times
ids <- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
                       function(p) rep(which(md$subclass == p), 
                                              numreps[p])))
 
  
#Subset md with block bootstrapped ids
md_boot <- md[ids,]
  
#Effect estimation
fit_boot <- lm(Y_C ~ A, data = md_boot, weights = weights)
  
#Return the coefficient on treatment
return(coef(fit_boot)["A"])
}

boot_est <- boot(pair_ids, est_fun, R = 499)
boot_est
boot.ci(boot_est, type = "bca")

Is there a mistake in the est_fun?

Running pair_ids <- levels(md$subclass) only generates c(1,2,3,4,5,6,7……）(each observation in pair_ids has the size of 1), because it just outputs the level.

So running numreps <- table(pairs[i]),if i=3 , it outputs [1] "3"; numreps is not "the number of times each pair is present".

Running numreps[p](here numreps[3]), it outputs NULL and it makes rep(which(md$subclass == p), numreps[p]) seem meaningless (the times of replication is always 1).

Although this case code outputs right results of the effect estimates, maybe is there a little bit of invalidity in this function we set?

CodePudding user response：

Read the documentation for boot::boot(). i is not a single number, it is a vector of length length(pair_ids) that is formed by running

i <- sample(seq_along(pair_ids), length(pair_ids), replace = TRUE)

many times. Because when bootstrapping you are sampling with replacement, it is possible (and almost guaranteed) for i to have duplicate entries, in which case table(pairs[i]) will not yield all 1s. What this function does is sample the pair IDs with replacement, create a new dataset with the units present in the resampled pair IDs, and then estimate the treatment effect in each new dataset. That is what a cluster bootstrap is supposed to do; there is nothing invalid about this implementation.

Try running this yourself:

i <- sample(seq_along(pair_ids), length(pair_ids), replace = TRUE)
table(pair_ids[i])
table(table(pair_ids[i]))

You will see that many pair IDs show up more than once; the third line shows how many times each number of duplicates is present.