I received some example code and example data that's originally for use in Stata. I want to rewrite the code for use in R, which I'm more familiar with than Stata. The ultimate goal is to understand this code and be able to adapt it for application to other data in future projects that will be done in R. For reference, the sample code is for an example of the rank and replace method described in this paper.
There is a step of that method in which observations need to be ordered by race and, within each race, an index describing their need for health care. Then each observation needs to assigned a survey-weighted rank based on how they were previously ordered. The sample Stata code for this step looks like this:
gsort race index
gen count1=_n if white==1
gen count2=_n if black==1
gen count3=_n if hispanic==1
cumul count1 if white==1 [weight=int(personwt)], gen (cumulw)
cumul count2 if black==1 [weight=int(personwt)], gen (cumulb)
cumul count3 if hispanic==1 [weight=int(personwt)], gen (cumulh)
where race
is a categorical race variable, index
is the numerical index variable describing health care need, personwt
is the variable containing the relevant survey weights. There are separate dichotomous variables in the sample data denoting whether the person represented in each observation had white
, black
, or hispanic
race/ethnicity.
Here's what I've written in R for the first four lines of that Stata code:
newdata <- data[order(data$race, data$index),]
newdata_w <- subset(newdata, white == 1)
newdata_w$count1 <- 1:nrow(newdata_w)
newdata_w$count2 <- NA
newdata_w$count3 <- NA
newdata_b <- subset(newdata, black == 1)
newdata_b$count1 <- NA
newdata_b$count2 <- 1:nrow(newdata_b)
newdata_b$count3 <- NA
newdata_h <- subset(newdata, hispanic == 1)
newdata_h$count1 <- NA
newdata_h$count2 <- NA
newdata_h$count3 <- 1:nrow(newdata_h)
newdata<-rbind(newdata_w, newdata_b, newdata_h)
(Side note: I realize that's probably not the prettiest way to get that accomplished, but it works as intended.)
The part I'm struggling with is the last three lines of the Stata code. I think those are creating the survey-weighted ranking variables. From the Stata documentation, I gather that cumul varx, gen(newvarx)
creates a new variable, defined as the empirical cumulative distribution function of varx and int()
rounds down the inputted value to the whole integer value--similar to floor()
in R. I admit that I don't fully understand the empirical cumulative distribution function, even after doing some reading, so I don't totally grasp what these lines of code are doing. Within each race/ethnicity group, they're generating a new variable that describes the the weighted empirical cumulative distribution function for the relevant count variable? Probably at least in part because I don't understand exactly what those 'cumul' lines are doing, I can't figure out how to translate them into R.
I found ewcdf()
in R's spatstat
package and have tried variations of this in R:
newdata$cumulw <- ifelse(newdata$white == 1, ewcdf(newdata$count1, weights = floor(newdata$personwt)), NA)
but I constantly get errors that say: Error in rep(yes, length.out = len) : attempt to replicate an object of type 'closure'.
When I run only the ewcdf()
part from the if-else statement, I get output but it's not a vector so I can't assign it to a new variable in the dataset. It's an ewcdf object. I suspect this is ultimately the problem with the R code I've been trying, but I don't understand how to get a vector describing the weighted empirical distribution function of the count variable.
Here is a very truncated version of the example data, which originally has many more observations and variables than this. I've also manually added new variables, expected_cumulw, expected_cumulb, expected_cumulh
, which show the values I get when for these observations when I run the Stata code on the full example data in Stata. That's what I expect to get when I run the recreated code in R.
structure(list(id = 1:18, personwt = structure(c(19755.78515625,
14693.3359375, 14680.841796875, 6505.64111328125, 8609.6337890625,
19847.62890625, 3344.6005859375, 6428.26025390625, 7187.40673828125,
7194.48046875, 5987.77783203125, 7835.4814453125, 20544.095703125,
10755.46875, 6342.7041015625, 14851.7568359375, 6332.66796875,
3053.63818359375), format.stata = "%9.0g"), race = structure(c(1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3), format.stata = "%9.0g"),
white = structure(c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), format.stata = "%9.0g"), black = structure(c(0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0), format.stata = "%9.0g"),
hispanic = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1), format.stata = "%9.0g"), index = structure(c(-5.71758786296799,
-5.65920841512226, -5.63605634220452, -5.63445733604188,
-5.59848291083199, -5.58385768359893, -5.50244319382152,
-5.50244314415607, -5.50244314415607, -5.50244314415607,
-5.48369698816299, -5.45280023951536, -5.77394051126251,
-5.70938540405403, -5.5200244210207, -5.51352104654952, -5.48369698816299,
-5.48369698816299), format.stata = ".0g"), count1 = c(1L,
2L, 3L, 4L, 5L, 6L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), count2 = c(NA, NA, NA, NA, NA, NA, 1L, 2L, 3L, 4L,
5L, 6L, NA, NA, NA, NA, NA, NA), count3 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, 2L, 3L, 4L, 5L, 6L),
expect_cumulw = c(2.64e-05, 4.6e-05, 6.56e-05, 7.43e-05,
8.58e-05, 0.0001124, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), expect_cumulb = c(NA, NA, NA, NA, NA, NA, 5.98e-05,
0.0001132, 0.000141, 0.0002008, 0.0002506, 0.0003158, NA,
NA, NA, NA, NA, NA), expect_cumulh = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, 0.0001462, 0.0002228, 0.0002679,
0.0003736, 0.0004186, 0.0004404)), row.names = c(NA, -18L
), class = c("tbl_df", "tbl", "data.frame"))
How can I successfully recreate this bit of Stata code in R?
Please let me know if there's anything else I can provide that would help answer my question, and thank you for your help!
CodePudding user response:
One possible solution
Note that cumul
command in your Stata example is simply computing the cumulative sum of the relative weights (cumsum(weights/sum(weights))). proportions
in R computes the relative weight (weights/sum(weights)) when applied to a vector. So, you can call cumsum
on top of proportions
to get the desired output.
library(dplyr)
cumul = function(x, w) cumsum(proportions(as.integer(w[x==1])))
df %>%
arrange(race, index) %>%
mutate(
cumulw = ifelse(white==1, cumul(white, personwt), NA),
cumulb = ifelse(black==1, cumul(black, personwt), NA),
cumulh = ifelse(hispanic==1, cumul(hispanic, personwt), NA)
)
CodePudding user response:
You can do this as follows:
df %>%
group_by(race) %>%
mutate(cumul_wt_race = cumsum(personwt/sum(personwt))
This produces a single column cumul_wt_race
An approach to separate columns could use mutate(across())
:
f <- function(p,r) {
res = cumsum(p*r/sum(p*r))
res[r==0] <- NA
res
}
mutate(df, across(white:hispanic,~f(personwt,.x),.names = "cumul_{col}"))