I've been exploring MatchIt() package in R, and wondering how to calculate eCDF Mean in this package. I have used data lalonde from this package, and running the matchit package
library("MatchIt")
data("lalonde")
m.out1 <- matchit(treat ~ age educ race married
nodegree re74 re75, data = lalonde,
method = "nearest", distance = "glm")
And the summary output of the matchit is
Call:
matchit(formula = treat ~ age educ race married nodegree
re74 re75, data = lalonde, method = "nearest", distance = "glm")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance 0.5774 0.1822 1.7941 0.9211 0.3774 0.6444
age 25.8162 28.0303 -0.3094 0.4400 0.0813 0.1577
educ 10.3459 10.2354 0.0550 0.4959 0.0347 0.1114
raceblack 0.8432 0.2028 1.7615 . 0.6404 0.6404
racehispan 0.0595 0.1422 -0.3498 . 0.0827 0.0827
racewhite 0.0973 0.6550 -1.8819 . 0.5577 0.5577
married 0.1892 0.5128 -0.8263 . 0.3236 0.3236
nodegree 0.7081 0.5967 0.2450 . 0.1114 0.1114
re74 2095.5737 5619.2365 -0.7211 0.5181 0.2248 0.4470
re75 1532.0553 2466.4844 -0.2903 0.9563 0.1342 0.2876
From the vignette("assesing-balance"), the average distance between the eCDFs of the covariate across the groups is eCDF Mean. So, I've been trying to calculate the eCDF Mean manually. For example for Age covariates.
First, I separate 2 data, "people1" for data treated, and "people2" for data untreated. And then I create the eCDF for age treated (A) and age untreated (B)
#AGE
people1$age
people=na.omit(people1$age)
age1=ecdf(as.numeric(people))
people2$age
people2=na.omit(people2$age)
age2=ecdf(as.numeric(people2))
as.list(environment(age1))
A=as.data.frame(cbind(as.list(environment(age1))$x, as.list(environment(age1))$y));A
as.list(environment(age2))
B=as.data.frame(cbind(as.list(environment(age2))$x, as.list(environment(age2))$y));B
The C matrix below is eCDF of Treated (A) and Untreated (B).
C=merge(A,B,by="V1",all=TRUE);C
C=na.omit(C) #for delete the row with NA value
D=abs(C$V2.x-C$V2.y);summary(D)
And D is difference between eCDF treated (treat=1) and untreated (treat=0), but the result of the mean is:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01850 0.06193 0.08809 0.09113 0.11888 0.15773
As you can see the Max of Difference eCDF is same with the output of the MatchIt(), but the Mean of Difference eCDF is not same. Can anybody solve the problem? Or know how to calculate the eCDF Mean? Thank you!
CodePudding user response:
This is some of the most convoluted code I've ever seen. I'll simplify things and show you how the statistic is calculated. That said, this statistic has not been well studied and is part of the output primarily for historical reasons. Use eCDF Max (the Kolmogorov-Smirnov statistics) instead.
Step 1: get the eCDFs (which are functions, not vectors) from the treated and control units
ecdf1 <- ecdf(lalonde$age[lalonde$treat == 1])
ecdf0 <- ecdf(lalonde$age[lalonde$treat == 0])
What these functions do is take a value of the variable (age
) and return the cumulative density up to each value.
Step 2: evaluate the eCDFs at each unique value of age
The reason we have to use unique values is that the eCDF already accounts for the duplicate values by creating a step in the function.
cum.dens1 <- ecdf1(unique(lalonde$age))
cum.dens0 <- ecdf0(unique(lalonde$age))
Step 3: compute the mean and maximum values of the absolute difference
ecdf.diffs <- abs(cum.dens1 - cum.dens0)
mean(ecdf.diffs)
# [1] 0.08133907
max(ecdf.diffs)
# [1] 0.157727
We can see we get the right answers.
The actual code MatchIt
uses is a bit less transparent but it makes it run much faster.
CodePudding user response:
This is not an answer to the question but it's too big to be a comment.
The problem in the question comes from what seems to be the package MatchIt
way of computing the averages, they are weighted averages.
The code below has the same output as the question's code but I post it here because I think it's more idiomatic. It's definitely simpler.
library("MatchIt")
data("lalonde")
m.out1 <- matchit(treat ~ age educ race married
nodegree re74 re75, data = lalonde,
method = "nearest", distance = "glm")
summary(m.out1)
sp_lalonde <- split(lalonde, lalonde$treat)
tmp <- lapply(sp_lalonde, \(x){
e <- ecdf(x$age)
out <- as.list(environment(e))[c("x", "y")]
as.data.frame(out)
})
C <- Reduce(function(x, y) merge(x, y, by = "x", all = TRUE), tmp) |> na.omit()
D <- abs(C[[2]] - C[[3]])
summary(D)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#0.01850 0.06193 0.08809 0.09113 0.11888 0.15773
mean(apply(C[-1], 1, dist))
#[1] 0.09112509