Home > Enterprise >  What is the simplest way to compute the average of one variable grouped by a second variable, iterat
What is the simplest way to compute the average of one variable grouped by a second variable, iterat

Time:08-16

I have a data frame with a large number of variables, one of them, the probability of death to be predicted by all others. As a preliminary step I want to compute the PoD by counting the death rate in bins of each variable.

let's say df <- (age = c(25, 57, 60), weight = (80, 92, 61), cigarettes_a_day = c(30, 2, 19), death_flag=c(1,0,1))

Then I can group by age (say under 50 and over 50) and compute the PoD as the death rate of one group as the count of death_flags divided by the number of people falling into the group, or simply the average death_flag. When grouping by weight(say below and above 80) I will obtain a different death rate and thus a different PoD, for each binned variable, which is what I want. My problem arises when trying to iterate through all variables.

So far I've tried variants of the following piece of code, which however does not work:

for(n in names(df)) {

    df%>% group_by(n)%>%
      summarise(PoD_bin = mean(death_flag))
}

I haven't figured out a way to run through all variables and perform the computation.

As a side note, the binning of variables I have done without dplyr by:

for(v in names(df[-1])){
    newVar <- paste(f, "bin", sep = "_")
    df[newVar] <- cut(as.matrix(df[v]), breaks = 100)
}

I am irritated, that I cannot refer to the variables in the first for loop for the grouping, while I can do so in the second to create new columns of the df.

Help is greatly appreciated!

CodePudding user response:

Your loop doesn't work because a character is parsed to group_by. You could modify your loop a little bit and get the desired result. I have added print() to see the output.

for (n in names(df)) {
  
  df |>
    group_by(!!sym(n)) |>
    summarise(PoD_bin = mean(death_flag)) |>
    print()
  
}

Output:

# A tibble: 3 × 2
    age PoD_bin
  <dbl>   <dbl>
1    25       1
2    57       0
3    60       1
# A tibble: 3 × 2
  weight PoD_bin
   <dbl>   <dbl>
1     61       1
2     80       1
3     92       0
# A tibble: 3 × 2
  cigarettes_a_day PoD_bin
             <dbl>   <dbl>
1                2       0
2               19       1
3               30       1
# A tibble: 2 × 2
  death_flag PoD_bin
       <dbl>   <dbl>
1          0       0
2          1       1

Data:

df <- tibble(age = c(25, 57, 60), weight = c(80, 92, 61), cigarettes_a_day = c(30, 2, 19), death_flag=c(1,0,1))
  • Related