Home > database >  Group by ID and Outcome and take the earliest earliest Dates and assign numbers (i.e outcome1, outco
Group by ID and Outcome and take the earliest earliest Dates and assign numbers (i.e outcome1, outco

Time:12-17

Apologies I am not sure this warrants another question. But am basing this off of another questions (Group by ID and Outcome and take the earliest earliest Dates of specific outcomes and assign numbers (i.e outcome1, outcome2)).

mydata = data.frame (Id =c (1,1,1,1,1,1,1,1,2,2,2,2),
Date = c("2001-01-31", "2001-02-13","2001-05-31",
"2001-06-02","2018-01-31","2018-03-31","2018-07-31",
"2019-04-04","2014-01-31","2014-02-02","2014-04-31",
"2014-05-18"),Outcome = c("Relapse","CR","Relapse","Relapse",
"CR","CR","CR","Relapse","CR", "CR","Relapse","CR"))

Which outputs the below.

Id     Date        Outcome
1   2001-01-31  Relapse      
1   2001-02-13  CR      
1   2001-05-31  Relapse     
1   2001-06-02  Relapse     
1   2018-01-31  CR      
1   2018-03-31  CR      
1   2018-07-31  CR      
1   2019-04-04  Relapse     
2   2014-01-31  CR      
2   2014-02-02  CR
2   2014-04-31  Relapse     
2   2014-05-18  CR      

As you can see each patient is in certain phases at different times and I would like to capture the earliest dates of when each new phase starts per patient. I would like then to rename these phases to CR1, Relapse1, CR2, Relapse2 and so forth. But I would only like to start naming AFTER patient achieves their first CR1. So if a patient has a relapse before CR, those would be ignored.

Im hoping to get an output like this:

Id     CR1       Relapse1      CR2       Relapse2
1   2001-02-13  2001-05-31  2018-01-31  2019-04-04
2   2014-01-31  2014-04-31  2014-05-18  NA

As you see patient 1 had a relapse at 2001-01-31, but was ignored because it occured before the first CR1.

The following code was so helpful to the previous question, but was wondering if I can modify it so that it will start the counting after the first CR? Thank you!!

mydata %>%
  group_by(Id) %>%
  mutate(Grp = rleid(Outcome)) %>%
  group_by(Grp, .add = T) %>%
  slice(1) %>%
  group_by(Id, Outcome) %>%
  mutate(n = row_number()) %>%
  pivot_wider(id_cols = Id, names_from = c(Outcome, n), values_from = Date)

CodePudding user response:

Using data.table package

library(data.table)
mydata <- setDT(mydata)

# identify consecutive same Outcome for each Id
mydata[, flag:=rleid(Id, Outcome)]

# collapse identified consecutive values (keep min date)
tmp <- mydata[, by="flag,Id,Outcome", .(Date=min(Date))]

# mark relapse if first
tmp <- tmp[, by=Id,
  flag := (Outcome[1]=="Relapse") & (seq_len(.N)==1)
]

# remove marked relapse
tmp <- tmp[(!flag)]

# Outcome numbering
tmp[, by="Id,Outcome", n:=seq_len(.N)]

# cast (widen)
dcast(tmp, Id~paste0(Outcome, n), value.var = "Date")

Got

   Id        CR1        CR2   Relapse1   Relapse2
1:  1 2001-02-13 2018-01-31 2001-05-31 2019-04-04
2:  2 2014-01-31 2014-05-18 2014-04-31       <NA>

using %>% notation

mydata[, flag:=rleid(Id,Outcome)] %>%
  .[, by="flag,Id,Outcome", .( Date=min(Date) )] %>%
  .[, by="Id", flag := (Outcome[1]=="Relapse") & (seq_len(.N)==1)] %>%
  .[(!flag)] %>%
  .[, by="Id,Outcome", n := seq_len(.N)] %>%
  dcast(., Id ~ paste0(Outcome, n), value.var = "Date")

CodePudding user response:

That is my approach :

library(data.table)

my_df <- data.frame (Id =c (1,1,1,1,1,1,1,1,2,2,2,2),
                 Date = c("2001-01-31", "2001-02-13","2001-05-31",
                          "2001-06-02","2018-01-31","2018-03-31","2018-07-31",
                          "2019-04-04","2014-01-31","2014-02-02","2014-04-31",
                          "2014-05-18"),
                 Outcome = c("Relapse","Relapse","Relapse","Relapse",
                             "CR","CR","CR","Relapse","CR", "CR","Relapse","CR"),
                 stringsAsFactors = FALSE)

my_df <- my_df %>% group_by(Id) %>% arrange(Id, Date)

my_df <- my_df %>%
  group_by(Id) %>%
  mutate(Value = seq_along(Outcome)) %>%
  mutate(first_v = Outcome == "CR" & !duplicated(Outcome == "CR")) %>%
  mutate(first_a = first_v == "FALSE" & Value > Value[Outcome == "CR" & !duplicated(Outcome == "CR")] | Outcome == "CR") %>%
  filter(first_a == "TRUE")

my_df %>%
  group_by(Id) %>%
  mutate(Grp = rleid(Outcome)) %>%
  group_by(Grp, .add = T) %>%
  group_by(Id, Outcome) %>%
  mutate(n = row_number()) %>%
  pivot_wider(id_cols = Id, names_from = c(Outcome, n), values_from = Date)

CodePudding user response:

You could try adding:

filter(Date >= first(Date[Outcome == "CR"])) 

to filter out rows before the first "CR". This assumes your Date is sorted/arranged first.

library(tidyverse)
library(data.table)

mydata %>%
  group_by(Id) %>%
  filter(Date >= first(Date[Outcome == "CR"])) %>%
  mutate(Grp = rleid(Outcome)) %>%
  group_by(Grp, .add = T) %>%
  slice(1) %>%
  group_by(Id, Outcome) %>%
  mutate(n = row_number()) %>%
  pivot_wider(id_cols = Id, names_from = c(Outcome, n), values_from = Date)

Output

     Id CR_1       Relapse_1  CR_2       Relapse_2 
  <dbl> <chr>      <chr>      <chr>      <chr>     
1     1 2001-02-13 2001-05-31 2018-01-31 2019-04-04
2     2 2014-01-31 2014-04-31 2014-05-18 NA 
  • Related