About the example given in the document: https://cran.r-project.org/web/packages/MatchIt/vignettes/estimating-effects.html The code:
gen_X <- function(n) {
X <- matrix(rnorm(9 * n), nrow = n, ncol = 9)
X[,5] <- as.numeric(X[,5] < .5)
X
}
gen_A <- function(X) {
LP_A <- - 1.2 log(2)*X[,1] - log(1.5)*X[,2] log(2)*X[,4] - log(2.4)*X[,5] log(2)*X[,7] - log(1.5)*X[,8]
P_A <- plogis(LP_A)
rbinom(nrow(X), 1, P_A)
}
gen_Y_C <- function(A, X) {
2*A 2*X[,1] 2*X[,2] 2*X[,3] 1*X[,4] 2*X[,5] 1*X[,6] rnorm(length(A), 0, 5)
}
gen_Y_B <- function(A, X) {
LP_B <- -2 log(2.4)*A log(2)*X[,1] log(2)*X[,2] log(2)*X[,3] log(1.5)*X[,4] log(2.4)*X[,5] log(1.5)*X[,6]
P_B <- plogis(LP_B)
rbinom(length(A), 1, P_B)
}
gen_Y_S <- function(A, X) {
LP_S <- -2 log(2.4)*A log(2)*X[,1] log(2)*X[,2] log(2)*X[,3] log(1.5)*X[,4] log(2.4)*X[,5] log(1.5)*X[,6]
sqrt(-log(runif(length(A)))*2e4*exp(-LP_S))
}
set.seed(19599)
n <- 2000
X <- gen_X(n)
A <- gen_A(X)
Y_C <- gen_Y_C(A, X)
Y_B <- gen_Y_B(A, X)
Y_S <- gen_Y_S(A, X)
d <- data.frame(A, X, Y_C, Y_B, Y_S)
#=================================================================
pair_ids <- levels(md$subclass)
est_fun <- function(pairs, i) {
#Compute number of times each pair is present
numreps <- table(pairs[i])
#For each pair p, copy corresponding md row indices numreps[p] times
ids <- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
function(p) rep(which(md$subclass == p),
numreps[p])))
#Subset md with block bootstrapped ids
md_boot <- md[ids,]
#Effect estimation
fit_boot <- lm(Y_C ~ A, data = md_boot, weights = weights)
#Return the coefficient on treatment
return(coef(fit_boot)["A"])
}
boot_est <- boot(pair_ids, est_fun, R = 499)
boot_est
boot.ci(boot_est, type = "bca")
Is there a mistake in the est_fun
?
Running pair_ids <- levels(md$subclass)
only generates c(1,2,3,4,5,6,7……)
(each observation in pair_ids has the size of 1), because it just outputs the level.
So running numreps <- table(pairs[i])
,if i=3
, it outputs [1] "3"
; numreps
is not "the number of times each pair is present".
Running numreps[p]
(here numreps[3]
), it outputs NULL
and it makes rep(which(md$subclass == p), numreps[p])
seem meaningless (the times of replication is always 1).
Although this case code outputs right results of the effect estimates, maybe is there a little bit of invalidity in this function we set?
CodePudding user response:
Read the documentation for boot::boot()
. i
is not a single number, it is a vector of length length(pair_ids)
that is formed by running
i <- sample(seq_along(pair_ids), length(pair_ids), replace = TRUE)
many times. Because when bootstrapping you are sampling with replacement, it is possible (and almost guaranteed) for i
to have duplicate entries, in which case table(pairs[i])
will not yield all 1s. What this function does is sample the pair IDs with replacement, create a new dataset with the units present in the resampled pair IDs, and then estimate the treatment effect in each new dataset. That is what a cluster bootstrap is supposed to do; there is nothing invalid about this implementation.
Try running this yourself:
i <- sample(seq_along(pair_ids), length(pair_ids), replace = TRUE)
table(pair_ids[i])
table(table(pair_ids[i]))
You will see that many pair IDs show up more than once; the third line shows how many times each number of duplicates is present.