Group by ID and Outcome and take the earliest earliest Dates of specific outcomes and assign numbers-CodePudding

Apologies if I am not explaining this clearly. But I have the following dataset:

mydata = data.frame (Id =c (1,1,1,1,1,1,1,1,2,2,2,2),
Date = c("2001-01-31", "2001-02-13","2001-05-31",
"2001-06-02","2018-01-31","2018-03-31","2018-07-31",
"2019-04-04","2014-01-31","2014-02-02","2014-04-31",
"2014-05-18"),Outcome = c("CR","CR","Relapse","Relapse",
"CR","CR","CR","Relapse","CR", "CR","Relapse","CR"))

Which outputs the below. As you can see each patient is in certain phases at different times and I would like to capture the earliest dates of when each new phase starts per patient. I would like then to rename these phases to CR1, Relapse1, CR2, Relapse2 and so forth.

Id     Date        Outcome
1   2001-01-31  CR      
1   2001-02-13  CR      
1   2001-05-31  Relapse     
1   2001-06-02  Relapse     
1   2018-01-31  CR      
1   2018-03-31  CR      
1   2018-07-31  CR      
1   2019-04-04  Relapse     
2   2014-01-31  CR      
2   2014-02-02  CR
2   2014-04-31  Relapse     
2   2014-05-18  CR

This is the output I am trying to achieve:

Id     CR1       Relapse1      CR2       Relapse2
1   2001-01-31  2001-05-31  2018-01-31  2019-04-04
2   2014-01-31  2014-04-31  2014-05-18  NA

I am not quite sure where to begin with this question and would appreciate any help! Thank you guys!

CodePudding user response：

With tidyverse you can try the following.

(If needed, first arrange by Date so it is in chronological order.)

Create a grouping value depending on change in the Outcome column (from remission to relapse and vice versa). I used rleid from data.table for convenience and created a temporary column, Grp. For example:

      Id Date       Outcome   Grp
   <dbl> <chr>      <chr>   <int>
 1     1 2001-01-31 CR          1
 2     1 2001-02-13 CR          1
 3     1 2001-05-31 Relapse     2
 4     1 2001-06-02 Relapse     2
 5     1 2018-01-31 CR          3
 6     1 2018-03-31 CR          3
 7     1 2018-07-31 CR          3
 8     1 2019-04-04 Relapse     4
 9     2 2014-01-31 CR          1
10     2 2014-02-02 CR          1
11     2 2014-04-31 Relapse     2
12     2 2014-05-18 CR          3

You can see, within each Id, that when the Outcome changes, the Grp increases. That way, subsequent dates with the same Outcome will be included in the same Grp.

The .add argument allows us to add Grp to the prior grouping, which was just Id. So, grouping now by both Grp and Id, you can then slice the first row. slice(1) or slice(n = 1) will keep 1 row within the group. In this case, we have grouped by both Id and Grp, so only 1 row will be kept for a given Grp and Id combination.

Finally, you can add a row number that will allow for the wide output described (a consecutive sequence of numbers for CR and Relapse: 1, 2, 3, 4...). In this case, we group_by both Id and Outcome, and consecutively number the dates for this combination. pivot_wider will put the data into wide format if desired.

library(data.table)
library(tidyverse)

mydata %>%
  group_by(Id) %>%
  mutate(Grp = rleid(Outcome)) %>%
  group_by(Grp, .add = T) %>%
  slice(1) %>%
  group_by(Id, Outcome) %>%
  mutate(n = row_number()) %>%
  pivot_wider(id_cols = Id, names_from = c(Outcome, n), values_from = Date)

Output

     Id CR_1       Relapse_1  CR_2       Relapse_2 
  <dbl> <chr>      <chr>      <chr>      <chr>     
1     1 2001-01-31 2001-05-31 2018-01-31 2019-04-04
2     2 2014-01-31 2014-04-31 2014-05-18 NA