Home > Enterprise >  How to Calculate eCDF Mean in MatchIt() R
How to Calculate eCDF Mean in MatchIt() R

Time:11-07

I've been exploring MatchIt() package in R, and wondering how to calculate eCDF Mean in this package. I have used data lalonde from this package, and running the matchit package

library("MatchIt")
data("lalonde")
m.out1 <- matchit(treat ~ age   educ   race   married   
                   nodegree   re74   re75, data = lalonde,
                 method = "nearest", distance = "glm")

And the summary output of the matchit is

Call:
matchit(formula = treat ~ age   educ   race   married   nodegree   
    re74   re75, data = lalonde, method = "nearest", distance = "glm")

Summary of Balance for All Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance          0.5774        0.1822          1.7941     0.9211    0.3774   0.6444
age              25.8162       28.0303         -0.3094     0.4400    0.0813   0.1577
educ             10.3459       10.2354          0.0550     0.4959    0.0347   0.1114
raceblack         0.8432        0.2028          1.7615          .    0.6404   0.6404
racehispan        0.0595        0.1422         -0.3498          .    0.0827   0.0827
racewhite         0.0973        0.6550         -1.8819          .    0.5577   0.5577
married           0.1892        0.5128         -0.8263          .    0.3236   0.3236
nodegree          0.7081        0.5967          0.2450          .    0.1114   0.1114
re74           2095.5737     5619.2365         -0.7211     0.5181    0.2248   0.4470
re75           1532.0553     2466.4844         -0.2903     0.9563    0.1342   0.2876

From the vignette("assesing-balance"), the average distance between the eCDFs of the covariate across the groups is eCDF Mean. So, I've been trying to calculate the eCDF Mean manually. For example for Age covariates.

First, I separate 2 data, "people1" for data treated, and "people2" for data untreated. And then I create the eCDF for age treated (A) and age untreated (B)

#AGE
people1$age
people=na.omit(people1$age)
age1=ecdf(as.numeric(people))
people2$age
people2=na.omit(people2$age)
age2=ecdf(as.numeric(people2))

as.list(environment(age1))
A=as.data.frame(cbind(as.list(environment(age1))$x, as.list(environment(age1))$y));A
as.list(environment(age2))
B=as.data.frame(cbind(as.list(environment(age2))$x, as.list(environment(age2))$y));B

The C matrix below is eCDF of Treated (A) and Untreated (B).

C=merge(A,B,by="V1",all=TRUE);C
C=na.omit(C) #for delete the row with NA value 
D=abs(C$V2.x-C$V2.y);summary(D)

And D is difference between eCDF treated (treat=1) and untreated (treat=0), but the result of the mean is:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01850 0.06193 0.08809 0.09113 0.11888 0.15773

As you can see the Max of Difference eCDF is same with the output of the MatchIt(), but the Mean of Difference eCDF is not same. Can anybody solve the problem? Or know how to calculate the eCDF Mean? Thank you!

CodePudding user response:

This is some of the most convoluted code I've ever seen. I'll simplify things and show you how the statistic is calculated. That said, this statistic has not been well studied and is part of the output primarily for historical reasons. Use eCDF Max (the Kolmogorov-Smirnov statistics) instead.

Step 1: get the eCDFs (which are functions, not vectors) from the treated and control units

ecdf1 <- ecdf(lalonde$age[lalonde$treat == 1])
ecdf0 <- ecdf(lalonde$age[lalonde$treat == 0])

What these functions do is take a value of the variable (age) and return the cumulative density up to each value.

Step 2: evaluate the eCDFs at each unique value of age

The reason we have to use unique values is that the eCDF already accounts for the duplicate values by creating a step in the function.

cum.dens1 <- ecdf1(unique(lalonde$age))
cum.dens0 <- ecdf0(unique(lalonde$age))

Step 3: compute the mean and maximum values of the absolute difference

ecdf.diffs <- abs(cum.dens1 - cum.dens0)
mean(ecdf.diffs)
# [1] 0.08133907
max(ecdf.diffs)
# [1] 0.157727

We can see we get the right answers.

The actual code MatchIt uses is a bit less transparent but it makes it run much faster.

CodePudding user response:

This is not an answer to the question but it's too big to be a comment.

The problem in the question comes from what seems to be the package MatchIt way of computing the averages, they are weighted averages.

The code below has the same output as the question's code but I post it here because I think it's more idiomatic. It's definitely simpler.

library("MatchIt")
data("lalonde")

m.out1 <- matchit(treat ~ age   educ   race   married  
                    nodegree   re74   re75, data = lalonde,
                  method = "nearest", distance = "glm")
summary(m.out1)

sp_lalonde <- split(lalonde, lalonde$treat)
tmp <- lapply(sp_lalonde, \(x){
  e <- ecdf(x$age)
  out <- as.list(environment(e))[c("x", "y")]
  as.data.frame(out)
})
C <- Reduce(function(x, y) merge(x, y, by = "x", all = TRUE), tmp) |> na.omit()
D <- abs(C[[2]] - C[[3]])

summary(D)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#0.01850 0.06193 0.08809 0.09113 0.11888 0.15773 
mean(apply(C[-1], 1, dist))
#[1] 0.09112509
  • Related