I am using dplyr
and across
to summarise some data:
n <- 500
durations <- c(3, 5, 6, 8, 10)
prop_fever_resolve <- c(0.5, 0.6, 0.7, 0.8, 0.9)
prop_urine_sterile <- c(0.8, 0.85, 0.88, 0.9, 0.93)
prop_additional_abx <- c(0.2, 0.17, 0.14, 0.1, 0.05)
duration_props <- data.frame(duration = durations,
prop_fever_resolve = prop_fever_resolve,
prop_urine_sterile = prop_urine_sterile,
prop_additional_abx = prop_additional_abx)
set.seed(6942069)
data <- duration_props %>%
slice_sample(n = n, replace = TRUE) %>%
rowwise() %>%
mutate(fever_resolve = sample(c(FALSE, TRUE), 1, prob = c(1 - prop_fever_resolve, prop_fever_resolve)),
urine_sterile = sample(c(FALSE, TRUE), 1, prob = c(1 - prop_urine_sterile, prop_urine_sterile)),
additional_abx = sample(c(FALSE, TRUE), 1, prob = c(1 - prop_additional_abx, prop_additional_abx)),
cured = fever_resolve & urine_sterile & !additional_abx)
data %>%
group_by(across("duration")) %>%
summarize(n = n(),
pos = across("cured", sum),
prop = pos / n)
duration n pos$cured prop$cured
<dbl> <int> <int> <dbl>
1 3 95 37 0.389
2 5 106 38 0.358
3 6 98 59 0.602
4 8 105 69 0.657
5 10 96 82 0.854
Why are my column names called pos$cured
and prop$cured
?
CodePudding user response:
The reason is that across
returns a tibble
. Hence, pos
is a tibble with one column cured
. You can see that using e.g. str
:
library(dplyr)
foo <- data %>%
group_by(across("duration")) %>%
summarize(n = n(),
pos = across("cured", sum),
prop = pos / n)
str(foo)
#> tibble [5 × 4] (S3: tbl_df/tbl/data.frame)
#> $ duration: num [1:5] 3 5 6 8 10
#> $ n : int [1:5] 95 106 98 105 96
#> $ pos : tibble [5 × 1] (S3: tbl_df/tbl/data.frame)
#> ..$ cured: int [1:5] 37 38 59 69 82
#> $ prop :'data.frame': 5 obs. of 1 variable:
#> ..$ cured: num [1:5] 0.389 0.358 0.602 0.657 0.854
IMHO across
is not really needed here (see the answer by @harre). But if you insist on using across then you could use the .names
argument to get your desired result:
data %>%
group_by(across("duration")) %>%
summarize(n = n(),
across("cured", sum, .names = "pos"),
prop = pos / n)
#> # A tibble: 5 × 4
#> duration n pos prop
#> <dbl> <int> <int> <dbl>
#> 1 3 95 37 0.389
#> 2 5 106 38 0.358
#> 3 6 98 59 0.602
#> 4 8 105 69 0.657
#> 5 10 96 82 0.854
CodePudding user response:
It's an artefact of across
because you don't need across
to do what you want (@stefan offers you a further explanation). It's typically only needed if you need an operation over several variables.
Instead you'll want:
library(dplyr)
data %>%
group_by(duration) %>%
summarize(n = n(),
pos = sum(cured),
prop = pos / n)
Output:
# A tibble: 5 × 4
duration n pos prop
<dbl> <int> <int> <dbl>
1 3 95 37 0.389
2 5 106 38 0.358
3 6 98 59 0.602
4 8 105 69 0.657
5 10 96 82 0.854