Apologies if I am not explaining this clearly. But I have the following dataset:
mydata = data.frame (Id =c (1,1,1,1,1,1,1,1,2,2,2,2),
Date = c("2001-01-31", "2001-02-13","2001-05-31",
"2001-06-02","2018-01-31","2018-03-31","2018-07-31",
"2019-04-04","2014-01-31","2014-02-02","2014-04-31",
"2014-05-18"),Outcome = c("CR","CR","Relapse","Relapse",
"CR","CR","CR","Relapse","CR", "CR","Relapse","CR"))
Which outputs the below. As you can see each patient is in certain phases at different times and I would like to capture the earliest dates of when each new phase starts per patient. I would like then to rename these phases to CR1, Relapse1, CR2, Relapse2 and so forth.
Id Date Outcome
1 2001-01-31 CR
1 2001-02-13 CR
1 2001-05-31 Relapse
1 2001-06-02 Relapse
1 2018-01-31 CR
1 2018-03-31 CR
1 2018-07-31 CR
1 2019-04-04 Relapse
2 2014-01-31 CR
2 2014-02-02 CR
2 2014-04-31 Relapse
2 2014-05-18 CR
This is the output I am trying to achieve:
Id CR1 Relapse1 CR2 Relapse2
1 2001-01-31 2001-05-31 2018-01-31 2019-04-04
2 2014-01-31 2014-04-31 2014-05-18 NA
I am not quite sure where to begin with this question and would appreciate any help! Thank you guys!
CodePudding user response:
With tidyverse
you can try the following.
(If needed, first arrange
by Date
so it is in chronological order.)
Create a grouping value depending on change in the Outcome
column (from remission to relapse and vice versa). I used rleid
from data.table
for convenience and created a temporary column, Grp
. For example:
Id Date Outcome Grp
<dbl> <chr> <chr> <int>
1 1 2001-01-31 CR 1
2 1 2001-02-13 CR 1
3 1 2001-05-31 Relapse 2
4 1 2001-06-02 Relapse 2
5 1 2018-01-31 CR 3
6 1 2018-03-31 CR 3
7 1 2018-07-31 CR 3
8 1 2019-04-04 Relapse 4
9 2 2014-01-31 CR 1
10 2 2014-02-02 CR 1
11 2 2014-04-31 Relapse 2
12 2 2014-05-18 CR 3
You can see, within each Id
, that when the Outcome
changes, the Grp
increases. That way, subsequent dates with the same Outcome
will be included in the same Grp
.
The .add
argument allows us to add Grp
to the prior grouping, which was just Id
. So, grouping now by both Grp
and Id
, you can then slice
the first row. slice(1)
or slice(n = 1)
will keep 1 row within the group. In this case, we have grouped by both Id
and Grp
, so only 1 row will be kept for a given Grp
and Id
combination.
Finally, you can add a row number that will allow for the wide output described (a consecutive sequence of numbers for CR and Relapse: 1, 2, 3, 4...). In this case, we group_by
both Id
and Outcome
, and consecutively number the dates for this combination. pivot_wider
will put the data into wide format if desired.
library(data.table)
library(tidyverse)
mydata %>%
group_by(Id) %>%
mutate(Grp = rleid(Outcome)) %>%
group_by(Grp, .add = T) %>%
slice(1) %>%
group_by(Id, Outcome) %>%
mutate(n = row_number()) %>%
pivot_wider(id_cols = Id, names_from = c(Outcome, n), values_from = Date)
Output
Id CR_1 Relapse_1 CR_2 Relapse_2
<dbl> <chr> <chr> <chr> <chr>
1 1 2001-01-31 2001-05-31 2018-01-31 2019-04-04
2 2 2014-01-31 2014-04-31 2014-05-18 NA