I'm looking for R solution to the following problem:
I have a disease registry formatted as shown below:
Patient | Diagnosis | Date of diagnosis 1 | ... | Date of diagnosis 47 | ... | Dates of diagnosis n |
---|---|---|---|---|---|---|
ID0001 | C18.9 - Malignant neoplasm of colon [first disease mentioned]; Disease 2; ...; Disease n | 2020-01-21 | ... | ... | ... | ... |
.... | ... | ... | ... | ... | ... | ... |
ID18000 | [...]; C18.9 - Malignant neoplasm of colon [mentioned as 47th diagnosis out of 95]; [...] | ... | ... | 2005-03-04 | ... | ... |
ID18001 | C18.9 - Malignant neoplasm of colon [the last of n mentioned] | ... | ... | ... | ... | 2011-02-11 |
Where for each row (patient) there's a column with semicolon-separated disease names and consecutive columns format of each of diagnoses.
I want to derive from this dataset a binary variable for particular diagnosis and additional column with its date (for example "Colon cancer"). To do that, one has to now the position of the disease in Diagnosis column (as this will reflect the number of the Date column). As shown the place where the disease is mentioned can vary and therefore the column with dates varies between patients
My initial idea was to split the Diagnosis column to separate ones by semicolon but considering the size of the dataset it's not optimal.
I'm wondering whether there's any function available in stringr package that could solve this without a need for column split?
Thank you for help!
CodePudding user response:
Hopefully ive understood correctly but heres my solution using dplyr
df <- data.frame(Patient =c("ID0001","ID0002","ID0003"),Diagnosis=c("Disease1; Disease2; Disease3"),Date_of_diagnosis1=as.Date("2020-01-21"),Date_of_diagnosis2 = as.Date("2020-01-23"),Date_of_diagnosis3=as.Date("2015-12-01"))
df %>%
mutate(Diagnosis = strsplit(Diagnosis, ";")) %>%
unnest(Diagnosis)
Output
Patient Diagnosis Date_of_diagnosis1 Date_of_diagnosis2 Date_of_diagnosi~
<chr> <chr> <date> <date> <date>
1 ID0001 "Disease1" 2020-01-21 2020-01-23 2015-12-01
2 ID0001 " Disease2" 2020-01-21 2020-01-23 2015-12-01
3 ID0001 " Disease3" 2020-01-21 2020-01-23 2015-12-01
4 ID0002 "Disease1" 2020-01-21 2020-01-23 2015-12-01
5 ID0002 " Disease2" 2020-01-21 2020-01-23 2015-12-01
6 ID0002 " Disease3" 2020-01-21 2020-01-23 2015-12-01
7 ID0003 "Disease1" 2020-01-21 2020-01-23 2015-12-01
8 ID0003 " Disease2" 2020-01-21 2020-01-23 2015-12-01
9 ID0003 " Disease3" 2020-01-21 2020-01-23 2015-12-01
CodePudding user response:
I suggest you should convert your data from this embedded-and-wide format into a simpler one-diagnosis/date-per-row long format.
Your sample data is not much to work with, so here is some fake data, I hope it is somewhat representative:
dat <- data.frame(
Patient = c("ID0001","ID18000","ID18001"),
Diagnosis = c("Disease1;Disease2;Disease3", "Disease2;Disease17", "Disease1;Disease4;Disease5"),
Date_of_diagnoses1 = c("2018-01-21", "2018-01-22", "2019-01-23"),
Date_of_diagnoses2 = c("2019-02-21", "2019-02-22", "2019-02-23"),
Date_of_diagnoses3 = c("2020-03-21", NA, "2020-03-23")
)
dat
# Patient Diagnosis Date_of_diagnoses1 Date_of_diagnoses2 Date_of_diagnoses3
# 1 ID0001 Disease1;Disease2;Disease3 2018-01-21 2019-02-21 2020-03-21
# 2 ID18000 Disease2;Disease17 2018-01-22 2019-02-22 <NA>
# 3 ID18001 Disease1;Disease4;Disease5 2019-01-23 2019-02-23 2020-03-23
Using the tidyverse:
library(dplyr)
library(stringr)
library(purrr) # pmap_chr
dat %>%
tidyr::pivot_longer(-c(Patient, Diagnosis), names_to = "Sequence", values_to = "Date") %>%
filter(!is.na(Date)) %>%
mutate(
Date = as.Date(Date),
Sequence = as.integer(str_extract(Sequence, "[0-9] $")),
Diagnosis = purrr::pmap_chr(list(strsplit(Diagnosis, ";", fixed = TRUE), Sequence), `[[`)
)
# # A tibble: 8 x 4
# Patient Diagnosis Sequence Date
# <chr> <chr> <int> <date>
# 1 ID0001 Disease1 1 2018-01-21
# 2 ID0001 Disease2 2 2019-02-21
# 3 ID0001 Disease3 3 2020-03-21
# 4 ID18000 Disease2 1 2018-01-22
# 5 ID18000 Disease17 2 2019-02-22
# 6 ID18001 Disease1 1 2019-01-23
# 7 ID18001 Disease4 2 2019-02-23
# 8 ID18001 Disease5 3 2020-03-23
Assumptions about the data:
- The number of
Date_of_diagnoses#
fields is always correct, i.e., there are always as manyDate*
columns as;
-delimited diagnoses inDiagnosis
; - The number at end of each
Date
counts correctly and can be used as an index onDiagnosis
; this is not a strict requirement (it's not hard to work around), but I found it convenient to use it, and have some more assurance that we're always using the correctDate
with the correctDiagnosis
Diagnosis
is perfectly formed, no embedded semi-colons that will cloud the extraction
In general, while this does lengthen your data (perhaps significantly, depending on the number of diagnoses per patient), it also provides a much cleaner (in my opinion) view of the data: the ability to extract individual diseases much more easily.