I have a data frame with a large number of variables, one of them, the probability of death to be predicted by all others. As a preliminary step I want to compute the PoD by counting the death rate in bins of each variable.
let's say df <- (age = c(25, 57, 60), weight = (80, 92, 61), cigarettes_a_day = c(30, 2, 19), death_flag=c(1,0,1))
Then I can group by age (say under 50 and over 50) and compute the PoD as the death rate of one group as the count of death_flags divided by the number of people falling into the group, or simply the average death_flag. When grouping by weight(say below and above 80) I will obtain a different death rate and thus a different PoD, for each binned variable, which is what I want. My problem arises when trying to iterate through all variables.
So far I've tried variants of the following piece of code, which however does not work:
for(n in names(df)) {
df%>% group_by(n)%>%
summarise(PoD_bin = mean(death_flag))
}
I haven't figured out a way to run through all variables and perform the computation.
As a side note, the binning of variables I have done without dplyr by:
for(v in names(df[-1])){
newVar <- paste(f, "bin", sep = "_")
df[newVar] <- cut(as.matrix(df[v]), breaks = 100)
}
I am irritated, that I cannot refer to the variables in the first for loop for the grouping, while I can do so in the second to create new columns of the df.
Help is greatly appreciated!
CodePudding user response:
Your loop doesn't work because a character is parsed to group_by
. You could modify your loop a little bit and get the desired result. I have added print()
to see the output.
for (n in names(df)) {
df |>
group_by(!!sym(n)) |>
summarise(PoD_bin = mean(death_flag)) |>
print()
}
Output:
# A tibble: 3 × 2
age PoD_bin
<dbl> <dbl>
1 25 1
2 57 0
3 60 1
# A tibble: 3 × 2
weight PoD_bin
<dbl> <dbl>
1 61 1
2 80 1
3 92 0
# A tibble: 3 × 2
cigarettes_a_day PoD_bin
<dbl> <dbl>
1 2 0
2 19 1
3 30 1
# A tibble: 2 × 2
death_flag PoD_bin
<dbl> <dbl>
1 0 0
2 1 1
Data:
df <- tibble(age = c(25, 57, 60), weight = c(80, 92, 61), cigarettes_a_day = c(30, 2, 19), death_flag=c(1,0,1))