I have a column containing several codes and I need to check if each code is in the Codebook. I.e.,
AppliedCodes <- c("AAAA; BBBB; CCCC;", "CCCC; FFFF; GGGG", "AAAA; DDD; EEEE")
df <- data.frame(AppliedCodes)
Codebook <- c("AAAA", "BBBB", "CCCC", "DDDD", "EEEE", "FFFF")
In the outbut, the "DDD" should then be flagged as it is the only one not in Codebook. I.e.,
library(tidyverse)
df2 <- df %>%
mutate(UnknownCode = ???)
df2:
AppliedCodes | UnknownCode
AAAA; BBBB; CCCC; | No
CCCC; FFFF; GGGG | No
AAAA; DDD; EEEE | Yes
I tried to use str_detect collating the Codebook Vector into one long string, but that results in TRUE if any one of the AppliedCodes is in the Codebook;
Is there any way to solve this?
Thanks in advance
CodePudding user response:
First split each string at the semicolons using strsplit
, then check find out whether any
of these are not %in%
your codebook inside an sapply
df %>%
mutate(Unknown_code = sapply(strsplit(AppliedCodes, "; *"),
function(x) any(!x %in% Codebook)))
#> AppliedCodes Unknown_code
#> 1 AAAA; BBBB; CCCC; FALSE
#> 2 CCCC; FFFF; GGGG TRUE
#> 3 AAAA; DDD; EEEE TRUE
Note that GGGG doesn't appear in your example Codebook, so this is also identified as an unknown code.
CodePudding user response:
Here's an alternative solution that has the added value of telling you exactly which of the codes is not in the Codebook:
df %>%
# prepare data for transformation:
mutate(
# create row identifier:
ID = row_number(),
# tidy-up by removing trailing ";":
AppliedCodes = sub("\\W$", "", AppliedCodes)) %>%
# separate `Appliedcodes` values into rows:
separate_rows(AppliedCodes, sep = "; ?(?!$)") %>%
# match unknown values:
mutate(Unknown_code = !str_detect(AppliedCodes, paste0(Codebook, collapse = "|"))) %>%
# for each `ID`:
group_by(ID) %>%
# paste values together:
summarise(
AppliedCodes = str_c(AppliedCodes, collapse = ','),
Unknown_code = str_c(Unknown_code, collapse = ','))
# A tibble: 3 × 3
ID AppliedCodes Unknown_code
<int> <chr> <chr>
1 1 AAAA,BBBB,CCCC FALSE,FALSE,FALSE
2 2 CCCC,FFFF,GGGG FALSE,FALSE,TRUE
3 3 AAAA,DDD,EEEE FALSE,TRUE,FALSE
You can see from this that not only DDD
but also GGGG
is not in the Codebook