Home > database >  How to check if any string in a vector is present in a data frame column containing multiple potenti
How to check if any string in a vector is present in a data frame column containing multiple potenti

Time:08-13

I have a column containing several codes and I need to check if each code is in the Codebook. I.e.,

AppliedCodes <- c("AAAA; BBBB; CCCC;", "CCCC; FFFF; GGGG", "AAAA; DDD; EEEE")
df <- data.frame(AppliedCodes)
Codebook     <- c("AAAA", "BBBB", "CCCC", "DDDD", "EEEE", "FFFF")

In the outbut, the "DDD" should then be flagged as it is the only one not in Codebook. I.e.,

library(tidyverse)
df2 <- df %>%
  mutate(UnknownCode = ???)

df2:

AppliedCodes      |  UnknownCode
AAAA; BBBB; CCCC; |  No
CCCC; FFFF; GGGG  |  No
AAAA; DDD; EEEE   |  Yes

I tried to use str_detect collating the Codebook Vector into one long string, but that results in TRUE if any one of the AppliedCodes is in the Codebook;

Is there any way to solve this?

Thanks in advance

CodePudding user response:

First split each string at the semicolons using strsplit, then check find out whether any of these are not %in% your codebook inside an sapply

df %>%
  mutate(Unknown_code = sapply(strsplit(AppliedCodes, "; *"),
                               function(x) any(!x %in% Codebook)))
#>        AppliedCodes Unknown_code
#> 1 AAAA; BBBB; CCCC;        FALSE
#> 2  CCCC; FFFF; GGGG         TRUE
#> 3   AAAA; DDD; EEEE         TRUE

Note that GGGG doesn't appear in your example Codebook, so this is also identified as an unknown code.

CodePudding user response:

Here's an alternative solution that has the added value of telling you exactly which of the codes is not in the Codebook:

df %>%
  # prepare data for transformation:
  mutate(
    # create row identifier:
    ID = row_number(),
    # tidy-up by removing trailing ";":
    AppliedCodes = sub("\\W$", "", AppliedCodes)) %>%
  # separate `Appliedcodes` values into rows:
  separate_rows(AppliedCodes, sep = "; ?(?!$)") %>%
  # match unknown values:
  mutate(Unknown_code = !str_detect(AppliedCodes, paste0(Codebook, collapse = "|"))) %>%
  # for each `ID`:
  group_by(ID) %>%
  # paste values together:
  summarise(
            AppliedCodes = str_c(AppliedCodes, collapse = ','),
            Unknown_code = str_c(Unknown_code, collapse = ','))
# A tibble: 3 × 3
     ID AppliedCodes   Unknown_code     
  <int> <chr>          <chr>            
1     1 AAAA,BBBB,CCCC FALSE,FALSE,FALSE
2     2 CCCC,FFFF,GGGG FALSE,FALSE,TRUE 
3     3 AAAA,DDD,EEEE  FALSE,TRUE,FALSE

You can see from this that not only DDD but also GGGG is not in the Codebook

  • Related