Home > Software engineering >  Column names after using dplyr across
Column names after using dplyr across

Time:11-25

I am using dplyr and across to summarise some data:

n <- 500
durations <- c(3, 5, 6, 8, 10)
prop_fever_resolve <- c(0.5, 0.6, 0.7, 0.8, 0.9)
prop_urine_sterile <- c(0.8, 0.85, 0.88, 0.9, 0.93)
prop_additional_abx <- c(0.2, 0.17, 0.14, 0.1, 0.05)

duration_props <- data.frame(duration = durations,
                             prop_fever_resolve = prop_fever_resolve,
                             prop_urine_sterile = prop_urine_sterile,
                             prop_additional_abx = prop_additional_abx)

set.seed(6942069)

data <- duration_props %>%
  slice_sample(n = n, replace = TRUE) %>%
  rowwise() %>%
  mutate(fever_resolve = sample(c(FALSE, TRUE), 1, prob = c(1 - prop_fever_resolve, prop_fever_resolve)),
         urine_sterile = sample(c(FALSE, TRUE), 1, prob = c(1 - prop_urine_sterile, prop_urine_sterile)),
         additional_abx = sample(c(FALSE, TRUE), 1, prob = c(1 - prop_additional_abx, prop_additional_abx)),
         cured = fever_resolve & urine_sterile & !additional_abx)

data %>%
  group_by(across("duration")) %>%
  summarize(n = n(),
            pos = across("cured", sum),
            prop = pos / n)

  duration     n pos$cured prop$cured
     <dbl> <int>     <int>      <dbl>
1        3    95        37      0.389
2        5   106        38      0.358
3        6    98        59      0.602
4        8   105        69      0.657
5       10    96        82      0.854

Why are my column names called pos$cured and prop$cured?

CodePudding user response:

The reason is that across returns a tibble. Hence, pos is a tibble with one column cured. You can see that using e.g. str:

library(dplyr)

foo <- data %>%
  group_by(across("duration")) %>%
  summarize(n = n(),
            pos = across("cured", sum),
            prop = pos / n)

str(foo)
#> tibble [5 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ duration: num [1:5] 3 5 6 8 10
#>  $ n       : int [1:5] 95 106 98 105 96
#>  $ pos     : tibble [5 × 1] (S3: tbl_df/tbl/data.frame)
#>   ..$ cured: int [1:5] 37 38 59 69 82
#>  $ prop    :'data.frame':    5 obs. of  1 variable:
#>   ..$ cured: num [1:5] 0.389 0.358 0.602 0.657 0.854

IMHO across is not really needed here (see the answer by @harre). But if you insist on using across then you could use the .names argument to get your desired result:

data %>%
  group_by(across("duration")) %>%
  summarize(n = n(),
            across("cured", sum, .names = "pos"),
            prop = pos / n)
#> # A tibble: 5 × 4
#>   duration     n   pos  prop
#>      <dbl> <int> <int> <dbl>
#> 1        3    95    37 0.389
#> 2        5   106    38 0.358
#> 3        6    98    59 0.602
#> 4        8   105    69 0.657
#> 5       10    96    82 0.854

CodePudding user response:

It's an artefact of across because you don't need across to do what you want (@stefan offers you a further explanation). It's typically only needed if you need an operation over several variables.

Instead you'll want:

library(dplyr)

data %>%
  group_by(duration) %>%
  summarize(n = n(),
            pos = sum(cured),
            prop = pos / n)

Output:

# A tibble: 5 × 4
  duration     n   pos  prop
     <dbl> <int> <int> <dbl>
1        3    95    37 0.389
2        5   106    38 0.358
3        6    98    59 0.602
4        8   105    69 0.657
5       10    96    82 0.854
  • Related