I'm wrangling a dataset that has a cross-over trial design. Here is a toy example with similar structure:
df <- structure(list(subject = c("a", "a", "a", "a", "a", "a", "b",
"b", "b", "b", "c", "c", "c", "c", "c", "c"), treatment = c("none",
"placebo", "placebo", "drug", "drug", "drug", "none", "drug",
"placebo", "placebo", "none", "placebo", "drug", "drug", "drug",
"drug"), day = c(0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 0, 1, 2, 3, 4,
5)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))
# A tibble: 16 × 3
subject treatment day
<chr> <chr> <dbl>
1 a none 0
2 a placebo 1
3 a placebo 2
4 a drug 3
5 a drug 4
6 a drug 5
7 b none 0
8 b drug 1
9 b placebo 2
10 b placebo 3
11 c none 0
12 c placebo 1
13 c drug 2
14 c drug 3
15 c drug 4
16 c drug 5
So, every subject starts with a "none" value in treatment
, and then has a few days of either placebo
or a drug
treatment, followed by a few days of the other.
What I want is a new stage
column that tells me the chronological stage of the experiment based on the order of treatment
value. In other words, the starting none
value in a subject will always be the "first" stage of the experiment, the next chronologically appearing value in treatment
within that subject will be "second" stage, and the last appearing value will be the "third" stage.
So the output I want would look like this:
# A tibble: 16 × 4
subject treatment day stage
<chr> <chr> <dbl> <chr>
1 a none 0 first
2 a placebo 1 second
3 a placebo 2 second
4 a drug 3 third
5 a drug 4 third
6 a drug 5 third
7 b none 0 first
8 b drug 1 second
9 b placebo 2 third
10 b placebo 3 third
11 c none 0 first
12 c placebo 1 second
13 c drug 2 third
14 c drug 3 third
15 c drug 4 third
16 c drug 5 third
What made sense to me was using a combination of group_by
and mutate
with a factor
of treatment
, which doesn't work
#my failed attempt
df %>%
arrange(subject, day) %>% #needed for my actual dataset
group_by(subject) %>%
mutate(stage=factor(treatment, levels=c("first", "second", "third"))) %>%
ungroup(
which gives:
# A tibble: 16 × 4
subject treatment day stage
<chr> <chr> <dbl> <fct>
1 a none 0 second
2 a placebo 1 third
3 a placebo 2 third
4 a drug 3 first
5 a drug 4 first
6 a drug 5 first
7 b none 0 second
8 b drug 1 first
9 b placebo 2 third
10 b placebo 3 third
11 c none 0 second
12 c placebo 1 third
13 c drug 2 first
14 c drug 3 first
15 c drug 4 first
16 c drug 5 first
The problem is that the labels are being shown based on the alphabetic order of the "treatment" values, but I want them to appear in the order of appearance of the treatment
values within each subject. I also tried with levels
instead of labels
and I just get all NA
s.
Any help would be much appreciated. A dplyr
solution is preferred but would happily work with any other.
CodePudding user response:
You can group_by
subject and then use match
or rleid
. Use english::ordinal
to get the expected output.
df %>%
group_by(subject) %>%
mutate(match = match(treatment, unique(treatment)),
rleid = data.table::rleid(treatment),
stage = english::ordinal(match))
# A tibble: 16 × 6
# Groups: subject [3]
subject treatment day match rleid stage
<chr> <chr> <dbl> <int> <int> <ordinal>
1 a none 0 1 1 first
2 a placebo 1 2 2 second
3 a placebo 2 2 2 second
4 a drug 3 3 3 third
5 a drug 4 3 3 third
6 a drug 5 3 3 third
7 b none 0 1 1 first
8 b drug 1 2 2 second
9 b placebo 2 3 3 third
10 b placebo 3 3 3 third
11 c none 0 1 1 first
12 c placebo 1 2 2 second
13 c drug 2 3 3 third
14 c drug 3 3 3 third
15 c drug 4 3 3 third
16 c drug 5 3 3 third
CodePudding user response:
If there is any case where the placebo is given again after the drug, creating a "fourth" stage, then depending on unique(treatment)
might lead to errors.
Alternatively, you could calculate the cumulative sum of the treatment change:
library(tidyverse)
df %>%
group_by(subject) %>%
mutate(stage_change = treatment!=lag(treatment),
stage = cumsum(ifelse(is.na(stage_change), 1, stage_change))) %>%
select(-stage_change)
#> # A tibble: 16 x 4
#> # Groups: subject [3]
#> subject treatment day stage
#> <chr> <chr> <dbl> <dbl>
#> 1 a none 0 1
#> 2 a placebo 1 2
#> 3 a placebo 2 2
#> 4 a drug 3 3
#> 5 a drug 4 3
#> 6 a drug 5 3
#> 7 b none 0 1
#> 8 b drug 1 2
#> 9 b placebo 2 3
#> 10 b placebo 3 3
#> 11 c none 0 1
#> 12 c placebo 1 2
#> 13 c drug 2 3
#> 14 c drug 3 3
#> 15 c drug 4 3
#> 16 c drug 5 3
Created on 2022-05-03 by the reprex package (v2.0.1)
You can then use english::ordinal(stage)
if needed.