How to split rows into multiple rows depending on time intervals while keeping some covariates but c-CodePudding

I am looking for a solution in dplyr.

Let's say I have this dataframe p

  id treatment  age sex progression  pfs
1  1      SSTR 31.3   0           0 15.6
2  2      SSTR 36.9   0           1  8.9
3  3      SSTR 44.6   1           1 25.5

Each patient (id) received a treatment and was followed until progression p$progression == 1 or censored progression-free p$progression == 0 with p$pfs as follow-up time (in months).

I want to split each row into multiple intervals for every 12-months follow-up.

Some covariates does not change in each new row, while others should change.

Unchanged

p$id
p$treatment
p$sex

(and others in my dataframe)

Covariates that change

p$age should increase by 1 for every new time interval (since each patient are getting 1-year old per 12 months)
p$pfs are split into two new covariates: p$start and p$stop indicating the 12-months intervals
A new covariate p$interval is created to indicate the number of intervals (0 - 12 months is interval == 1, 12 - 24 months is interval == 2, 24 - 36 months is interval == 3 and so on)

Expected output

  id treatment sex  age progression start stop interval
1  1      SSTR   0 31.3           0     0 12.0        1
2  1      SSTR   0 32.3           0    12 15.6        2
3  2      SSTR   0 36.9           1     0  8.9        1
4  3      SSTR   1 44.6           0     0 12.0        1
5  3      SSTR   1 45.6           0    12 24.0        2
6  3      SSTR   1 46.6           1    24 25.5        3

Data

p <- structure(list(id = 1:3, treatment = structure(c(1L, 1L, 1L), levels = c("SSTR", 
                                                                         "SSA", "Control"), class = "factor"), age = c(31.3, 36.9, 44.6
                                                                         ), sex = structure(c(1L, 1L, 2L), levels = c("0", "1"), class = "factor"), 
               progression = c(0L, 1L, 1L), pfs = c(15.6, 8.9, 25.5)), row.names = c(NA, 
                                                                                   3L), class = "data.frame")

CodePudding user response：

Here's a solution using a list column with unnest:

library(dplyr)
library(purrr)
library(tidyr)

p %>% 
  mutate(interval = map(pfs %/% 12L   1L, seq_len)) %>% 
  unnest(interval) %>% 
  mutate(start = 12L * (interval - 1L),
         stop = pmin(pfs, 12L * interval),
         age = age   (interval - 1L)) %>%
  group_by(id) %>% 
  mutate(progression = if_else(interval != max(interval), 0L, progression)) %>% 
  select(id, treatment, sex, age, progression, start, stop, interval)

# # A tibble: 6 × 8
# # Groups:   id [3]
#      id treatment sex     age progression start  stop interval
#   <int> <fct>     <fct> <dbl>       <int> <int> <dbl>    <int>
# 1     1 SSTR      0      31.3           0     0  12          1
# 2     1 SSTR      0      32.3           0    12  15.6        2
# 3     2 SSTR      0      36.9           1     0   8.9        1
# 4     3 SSTR      1      44.6           0     0  12          1
# 5     3 SSTR      1      45.6           0    12  24          2
# 6     3 SSTR      1      46.6           1    24  25.5        3

Idea is that you create a list column with interval counters first and unnest them then (i.e. expand it). Here is the solution in slow-mo:

p %>% 
  mutate(interval = map(pfs %/% 12L   1L, seq_len)) 

#   id treatment  age sex progression  pfs interval
# 1  1      SSTR 31.3   0           0 15.6     1, 2
# 2  2      SSTR 36.9   0           1  8.9        1
# 3  3      SSTR 44.6   1           1 25.5  1, 2, 3

p %>% 
  mutate(interval = map(pfs %/% 12L   1L, seq_len)) %>% 
  unnest(interval)

# # A tibble: 6 × 7
#      id treatment   age sex   progression   pfs interval
#   <int> <fct>     <dbl> <fct>       <int> <dbl>    <int>
# 1     1 SSTR       31.3 0               0  15.6        1
# 2     1 SSTR       31.3 0               0  15.6        2
# 3     2 SSTR       36.9 0               1   8.9        1
# 4     3 SSTR       44.6 1               1  25.5        1
# 5     3 SSTR       44.6 1               1  25.5        2
# 6     3 SSTR       44.6 1               1  25.5        3

The rest is rather straight forward.