Within group_by, mutate a new column that grabs values of a column based on their order of appearanc-CodePudding

I'm wrangling a dataset that has a cross-over trial design. Here is a toy example with similar structure:

df <- structure(list(subject = c("a", "a", "a", "a", "a", "a", "b", 
"b", "b", "b", "c", "c", "c", "c", "c", "c"), treatment = c("none", 
"placebo", "placebo", "drug", "drug", "drug", "none", "drug", 
"placebo", "placebo", "none", "placebo", "drug", "drug", "drug", 
"drug"), day = c(0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 0, 1, 2, 3, 4, 
5)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))

# A tibble: 16 × 3
   subject treatment   day
   <chr>   <chr>     <dbl>
 1 a       none          0
 2 a       placebo       1
 3 a       placebo       2
 4 a       drug          3
 5 a       drug          4
 6 a       drug          5
 7 b       none          0
 8 b       drug          1
 9 b       placebo       2
10 b       placebo       3
11 c       none          0
12 c       placebo       1
13 c       drug          2
14 c       drug          3
15 c       drug          4
16 c       drug          5

So, every subject starts with a "none" value in treatment, and then has a few days of either placebo or a drug treatment, followed by a few days of the other. What I want is a new stage column that tells me the chronological stage of the experiment based on the order of treatment value. In other words, the starting none value in a subject will always be the "first" stage of the experiment, the next chronologically appearing value in treatment within that subject will be "second" stage, and the last appearing value will be the "third" stage.

So the output I want would look like this:

# A tibble: 16 × 4
   subject treatment   day stage 
   <chr>   <chr>     <dbl> <chr> 
 1 a       none          0 first 
 2 a       placebo       1 second
 3 a       placebo       2 second
 4 a       drug          3 third 
 5 a       drug          4 third 
 6 a       drug          5 third 
 7 b       none          0 first 
 8 b       drug          1 second
 9 b       placebo       2 third 
10 b       placebo       3 third 
11 c       none          0 first 
12 c       placebo       1 second
13 c       drug          2 third 
14 c       drug          3 third 
15 c       drug          4 third 
16 c       drug          5 third

What made sense to me was using a combination of group_by and mutate with a factor of treatment, which doesn't work

#my failed attempt
df %>% 
  arrange(subject, day) %>% #needed for my actual dataset
  group_by(subject) %>% 
  mutate(stage=factor(treatment, levels=c("first", "second", "third"))) %>% 
  ungroup(

which gives:

# A tibble: 16 × 4
   subject treatment   day stage 
   <chr>   <chr>     <dbl> <fct> 
 1 a       none          0 second
 2 a       placebo       1 third 
 3 a       placebo       2 third 
 4 a       drug          3 first 
 5 a       drug          4 first 
 6 a       drug          5 first 
 7 b       none          0 second
 8 b       drug          1 first 
 9 b       placebo       2 third 
10 b       placebo       3 third 
11 c       none          0 second
12 c       placebo       1 third 
13 c       drug          2 first 
14 c       drug          3 first 
15 c       drug          4 first 
16 c       drug          5 first

The problem is that the labels are being shown based on the alphabetic order of the "treatment" values, but I want them to appear in the order of appearance of the treatment values within each subject. I also tried with levels instead of labels and I just get all NAs.

Any help would be much appreciated. A dplyr solution is preferred but would happily work with any other.

CodePudding user response：

You can group_by subject and then use match or rleid. Use english::ordinal to get the expected output.

df %>% 
  group_by(subject) %>% 
  mutate(match = match(treatment, unique(treatment)),
         rleid = data.table::rleid(treatment),
         stage = english::ordinal(match))

# A tibble: 16 × 6
# Groups:   subject [3]
   subject treatment   day match rleid stage       
   <chr>   <chr>     <dbl> <int> <int> <ordinal>
 1 a       none          0     1     1 first    
 2 a       placebo       1     2     2 second   
 3 a       placebo       2     2     2 second   
 4 a       drug          3     3     3 third    
 5 a       drug          4     3     3 third    
 6 a       drug          5     3     3 third    
 7 b       none          0     1     1 first    
 8 b       drug          1     2     2 second   
 9 b       placebo       2     3     3 third    
10 b       placebo       3     3     3 third    
11 c       none          0     1     1 first    
12 c       placebo       1     2     2 second   
13 c       drug          2     3     3 third    
14 c       drug          3     3     3 third    
15 c       drug          4     3     3 third    
16 c       drug          5     3     3 third

CodePudding user response：

If there is any case where the placebo is given again after the drug, creating a "fourth" stage, then depending on unique(treatment) might lead to errors.

Alternatively, you could calculate the cumulative sum of the treatment change:

library(tidyverse)

df %>% 
    group_by(subject) %>% 
    mutate(stage_change = treatment!=lag(treatment), 
           stage = cumsum(ifelse(is.na(stage_change), 1, stage_change))) %>% 
    select(-stage_change)
#> # A tibble: 16 x 4
#> # Groups:   subject [3]
#>    subject treatment   day stage
#>    <chr>   <chr>     <dbl> <dbl>
#>  1 a       none          0     1
#>  2 a       placebo       1     2
#>  3 a       placebo       2     2
#>  4 a       drug          3     3
#>  5 a       drug          4     3
#>  6 a       drug          5     3
#>  7 b       none          0     1
#>  8 b       drug          1     2
#>  9 b       placebo       2     3
#> 10 b       placebo       3     3
#> 11 c       none          0     1
#> 12 c       placebo       1     2
#> 13 c       drug          2     3
#> 14 c       drug          3     3
#> 15 c       drug          4     3
#> 16 c       drug          5     3

^{Created on 2022-05-03 by the reprex package (v2.0.1)}

You can then use english::ordinal(stage) if needed.