R:stringr - How to locate a position of a word in a string separated by semicolons?-CodePudding

I'm looking for R solution to the following problem:

I have a disease registry formatted as shown below:

Patient	Diagnosis	Date of diagnosis 1	...	Date of diagnosis 47	...	Dates of diagnosis n
ID0001	C18.9 - Malignant neoplasm of colon [first disease mentioned]; Disease 2; ...; Disease n	2020-01-21	...	...	...	...
....	...	...	...	...	...	...
ID18000	[...]; C18.9 - Malignant neoplasm of colon [mentioned as 47th diagnosis out of 95]; [...]	...	...	2005-03-04	...	...
ID18001	C18.9 - Malignant neoplasm of colon [the last of n mentioned]	...	...	...	...	2011-02-11

Where for each row (patient) there's a column with semicolon-separated disease names and consecutive columns format of each of diagnoses.

I want to derive from this dataset a binary variable for particular diagnosis and additional column with its date (for example "Colon cancer"). To do that, one has to now the position of the disease in Diagnosis column (as this will reflect the number of the Date column). As shown the place where the disease is mentioned can vary and therefore the column with dates varies between patients

My initial idea was to split the Diagnosis column to separate ones by semicolon but considering the size of the dataset it's not optimal.

I'm wondering whether there's any function available in stringr package that could solve this without a need for column split?

Thank you for help!

CodePudding user response：

Hopefully ive understood correctly but heres my solution using dplyr

df <- data.frame(Patient =c("ID0001","ID0002","ID0003"),Diagnosis=c("Disease1; Disease2; Disease3"),Date_of_diagnosis1=as.Date("2020-01-21"),Date_of_diagnosis2 = as.Date("2020-01-23"),Date_of_diagnosis3=as.Date("2015-12-01"))
df %>% 
  mutate(Diagnosis = strsplit(Diagnosis, ";")) %>% 
  unnest(Diagnosis)

Output

Patient Diagnosis   Date_of_diagnosis1 Date_of_diagnosis2 Date_of_diagnosi~
  <chr>   <chr>       <date>             <date>             <date>           
1 ID0001  "Disease1"  2020-01-21         2020-01-23         2015-12-01       
2 ID0001  " Disease2" 2020-01-21         2020-01-23         2015-12-01       
3 ID0001  " Disease3" 2020-01-21         2020-01-23         2015-12-01       
4 ID0002  "Disease1"  2020-01-21         2020-01-23         2015-12-01       
5 ID0002  " Disease2" 2020-01-21         2020-01-23         2015-12-01       
6 ID0002  " Disease3" 2020-01-21         2020-01-23         2015-12-01       
7 ID0003  "Disease1"  2020-01-21         2020-01-23         2015-12-01       
8 ID0003  " Disease2" 2020-01-21         2020-01-23         2015-12-01       
9 ID0003  " Disease3" 2020-01-21         2020-01-23         2015-12-01

CodePudding user response：

I suggest you should convert your data from this embedded-and-wide format into a simpler one-diagnosis/date-per-row long format.

Your sample data is not much to work with, so here is some fake data, I hope it is somewhat representative:

dat <- data.frame(
  Patient = c("ID0001","ID18000","ID18001"),
  Diagnosis = c("Disease1;Disease2;Disease3", "Disease2;Disease17", "Disease1;Disease4;Disease5"),
  Date_of_diagnoses1 = c("2018-01-21", "2018-01-22", "2019-01-23"),
  Date_of_diagnoses2 = c("2019-02-21", "2019-02-22", "2019-02-23"),
  Date_of_diagnoses3 = c("2020-03-21", NA, "2020-03-23")
)
dat
#   Patient                  Diagnosis Date_of_diagnoses1 Date_of_diagnoses2 Date_of_diagnoses3
# 1  ID0001 Disease1;Disease2;Disease3         2018-01-21         2019-02-21         2020-03-21
# 2 ID18000         Disease2;Disease17         2018-01-22         2019-02-22               <NA>
# 3 ID18001 Disease1;Disease4;Disease5         2019-01-23         2019-02-23         2020-03-23

Using the tidyverse:

library(dplyr)
library(stringr)
library(purrr)   # pmap_chr

dat %>%
  tidyr::pivot_longer(-c(Patient, Diagnosis), names_to = "Sequence", values_to = "Date") %>%
  filter(!is.na(Date)) %>%
  mutate(
    Date = as.Date(Date),
    Sequence = as.integer(str_extract(Sequence, "[0-9] $")),
    Diagnosis = purrr::pmap_chr(list(strsplit(Diagnosis, ";", fixed = TRUE), Sequence), `[[`)
  )
# # A tibble: 8 x 4
#   Patient Diagnosis Sequence Date      
#   <chr>   <chr>        <int> <date>    
# 1 ID0001  Disease1         1 2018-01-21
# 2 ID0001  Disease2         2 2019-02-21
# 3 ID0001  Disease3         3 2020-03-21
# 4 ID18000 Disease2         1 2018-01-22
# 5 ID18000 Disease17        2 2019-02-22
# 6 ID18001 Disease1         1 2019-01-23
# 7 ID18001 Disease4         2 2019-02-23
# 8 ID18001 Disease5         3 2020-03-23

Assumptions about the data:

The number of Date_of_diagnoses# fields is always correct, i.e., there are always as many Date* columns as ;-delimited diagnoses in Diagnosis;
The number at end of each Date counts correctly and can be used as an index on Diagnosis; this is not a strict requirement (it's not hard to work around), but I found it convenient to use it, and have some more assurance that we're always using the correct Date with the correct Diagnosis
Diagnosis is perfectly formed, no embedded semi-colons that will cloud the extraction

In general, while this does lengthen your data (perhaps significantly, depending on the number of diagnoses per patient), it also provides a much cleaner (in my opinion) view of the data: the ability to extract individual diseases much more easily.