I have the following data set:
df <- data.frame(identifier = c("a","b","b","c"),
disease = c("heart, lung","lung, heart,,","lung, heart, heart, liver", "kidney, brain "))
which gives:
identifier disease
1 a heart, lung
2 b lung, heart,,
3 b lung, heart, heart, liver
4 c kidney, brain
I want to be able to then go through the diseases, and for every condition create a new column. If the disease is present for a specific identifier, I want to then put a "yes" in that column. So the ideal output would be:
identifier heart lung liver kidney brain
1 a Yes Yes No No No
2 b Yes Yes No No No
3 b Yes Yes Yes No No
4 c No No No Yes Yes
Would greatly appreciate any help with this as it has stumped me for a couple of hours now :)
CodePudding user response:
After minor clean up of the input data, we can use separate_rows
and pivot_wider
to count the number of occurrences:
df <- data.frame(identifier = c("a","b","b","c"),
disease = c("heart, lung","lung, heart","lung, heart, heart, liver", "kidney, brain"))
df %>%
separate_rows(disease) %>%
mutate(myvalues="yes") %>%
pivot_wider(names_from="disease", values_from = myvalues, values_fn=first, values_fill = "no")
# identifier heart lung liver kidney brain
#1 a yes yes no no no
#2 b yes yes yes no no
#3 c no no no yes yes
Alternatively, if you want to count the results:
df %>%
separate_rows(disease) %>%
mutate(myvalues=1) %>%
pivot_wider(names_from="disease", values_from = myvalues, values_fn=sum, values_fill=0)
# identifier heart lung liver kidney brain
#1 a 1 1 0 0 0
#2 b 3 2 1 0 0
#3 c 0 0 0 1 1
An alternative is from the janitor
library tabyl()
function:
library(janitor)
df %>%
separate_rows(disease) %>%
tabyl(identifier, disease)
CodePudding user response:
You're going to need to do some cleaning of the df, getting rid of the commas
then one way to solve this is use %in%
with ifelse
and lapply
cleaned <- strsplit(gsub(",\\s?", " ", df$disease), "\\s\\s?")
unique_cleaned <- unique(unlist(cleaned))
cbind(df[1], do.call(rbind.data.frame,
lapply(cleaned, \(x) ifelse(unique_cleaned %in% x, "yes", "no"))) |>
setNames(unique_cleaned))
identifier heart lung liver kidney brain
1 a yes yes no no no
2 b yes yes no no no
3 b yes yes yes no no
4 c no no no yes yes