I have a two-step data cleaning problem for a dataset with patient pathways (e.g. Arrival -> Area A -> Ward). This is an example of how the data looks like:
df <- data.frame(Patient = c(1,2,3,4,5),
Area1 = c("Arrival1", "Arrival1", "Arrival2", "Arrival1", "Arrival2"),
Area2 = c("Area A", "Diagnostics", "Area A", "Area B", NA),
Area3 = c("Area B", "Diagnostics", "Area B", "Area A", NA),
Area4 = c("Ward", "Ward", "Area B", "Area C", NA),
Area5 = c(NA, NA, "Ward", "Arrival", NA)
)
Step 1: Removing duplicates in consecutive columns There are patients where there are duplicates in consecutive columns, e.g. for patient 2 (Diagnostics -> Diagnostics) and patient 3 (Area B -> Area B). I need these to be unique pathways.
I have solved this using apply()
and rle()
:
df1 <- apply(df,1,rle)
However, this gives me a (large) list with the values and lengths. How can I transfer that back into a data frame of the above form (i.e. keeping patient ID and values)? I have tried various versions of do.call
, rbindlist()
and unlist()
but none of them seem to work for me.
Step 2: Check logic of pathways Assume we now have a clean dataset:
dfclean <- data.frame(Patient = c(1,2,3,4,5),
Area1 = c("Arrival1", "Arrival1", "Arrival2", "Arrival1", "Arrival2"),
Area2 = c("Area A", "Diagnostics", "Area A", "Area B", NA),
Area3 = c("Area B", "Ward", "Area B", "Area A", NA),
Area4 = c("Ward", NA, "Ward", "Area C", NA),
Area5 = c(NA, NA, NA, "Arrival", NA)
)
Now I need to check the logic of the pathways. To do so, I have a second dataset that lists all possible pathways and I need to check for every pathway in dataset 1 whether this pathway is "possible" according to dataset 2. Suppose dataset 2 looks like that:
df2 <- data.frame(Patient = c(1,2,3,4,5),
Area1 = c("Arrival1", "Arrival1", "Arrival2", "Arrival1", "Arrival2"),
Area2 = c("Area A", "Diagnostics", "Area A", "Area B", NA),
Area3 = c("Area B", "Area A", "Area B", "Area A", NA),
Area4 = c("Ward", "Ward", "Ward", "Area C", NA),
Area5 = c(NA, NA, NA, NA, NA)
)
I would like to create a variable that indicates TRUE for valid pathways (e.g. Patient 1) and FALSE for invalid pathways (e.g. Patient 4). I have no idea how to do that...
CodePudding user response:
Step 1:
df[,-1] <- data.frame(t(apply(df[,-1], 1, function(z) {
r <- rle(z)
c(r$values, rep(NA, length(z) - length(r$values)))
})))
df
# Patient Area1 Area2 Area3 Area4 Area5
# 1 1 Arrival1 Area A Area B Ward <NA>
# 2 2 Arrival1 Diagnostics Ward <NA> <NA>
# 3 3 Arrival2 Area A Area B Ward <NA>
# 4 4 Arrival1 Area B Area A Area C Arrival
# 5 5 Arrival2 <NA> <NA> <NA> <NA>
Step 2: (tbd, pending "possible pathways")