Home > OS >  Search string for word, return variables with word position(s) within string in R/tidyverse
Search string for word, return variables with word position(s) within string in R/tidyverse

Time:12-03

I have a list of 39 unique alphanumeric pentacodes e.g. "KFC10", "TKC10", "DG012" in a dataframe variable (pattern$single_codes) which I would like to check if they are contained in another variable (database$multi_codes) in a different dataframe.

The (multi_codes) is a character variable that can contain from 0 upto 30 alphanumeric not necessarily unique (duplicates can occur) codes separated by whitespace, it would for instance look like "TKC10 JFB30 TKC10 DG001 DG012 DG002 TKC10 UGC12 DG012 TKC10" where it contains 10 whitespace separated codes in a string in this example.

The final result I would like is 39 extra variables added to the "database" dataframe named after the respective "single_codes" value e.g. (database$KFC10, database$TKC10, database$DG012 ...).

Each of the 39 extra variables would have an NA if the respective "single_codes" value was not found in the respective row within the database$multi_codes (NB: Searching for and returning position of words within the string, NOT characters). On the other hand, if the "single_codes" value was found, it should show the respective position within the whitespace separated "database$multi_codes" string.

For example if I was searching just those codes mentioned previously, in the single string mentioned, I would get 3 extra variables in database:

  1. database$KFC10 variable would have the value: NA.
  2. database$TKC10 variable would have the values: 1,3,7,10.
  3. database$DG012 variable would have the values: 5,9.

How would I do this?

I made an incomplete (not working/useful) attempt below, but am not sure how to complete this. Happy to receive a totally different approach from mine.

The pattern:

pattern <- structure(list(single_codes = c("JDW97", "JDW98", "JCA05", "JCA38", 
"JCA42", "JCA45", "AF021", "JCA96", "JCA98", "JCC00", "JCC10", 
"TJA10", "JCC20", "TJK01", "JCC96", "DV093", "JCD10", "JCW98", 
"JDA05", "JDA38", "QM007", "JDA52", "JDA63", "JDC00", "JDC10", 
"DT022", "JDC20", "JDC30", "JDC40", "JDC96", "JDC97", "AF063", 
"JDD01", "JDD96", "JDW96", "AF037", "UJD02", "UJD05", "PN000"
)), row.names = c(NA, -39L), class = "data.frame")

The database:

database <- structure(list(id = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6, 6255, 
6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 
6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 7290, 7290, 7290, 
7290, 7290, 7290, 11832, 11832, 13991, 13991, 13991, 13991, 13991, 
13991, 13991, 13991, 13991, 13991), multi_codes = c("", "AF063", 
"UJD05", "JCF12 JFF00 UJD02", "", "", "TPX10", "", "UJD02", "AV034 DT016 JDC00 DV065 QB008", 
"UGC12 UJC02 UEN12 UEN05 XXA00", "JCC10 JCC10 DR036 DR036 DR029 DR029", 
"8340 8440", "JCA45", "", "", "AF070 AF012", "FNG05 AF037", "AF021 AF063", 
"AF063 AF012", "AF037", "AF037", "AF021 AF070", "ABC56", "UJD05", 
"", "XV015", "", "AF021 AF064 CKD05 DR016 DR029", "XS007", "", 
"AF063 AP029 AP051 DG017 DG021 DG023 DG024 DR029 DR055 DV065 GBB00 SP299 SP311 TJD10 TKC20 XV018 ZXH10 ZXH60", 
"DR029 JDA63 JDH35 JFB43 JWA00 UJD02", "DV051 DV093 DV094 PM003 PN000 PN001 PT000 QB003 QG001 QG001 QG001 QM007 QM007 QM007 QM007 QM007 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QN000 QN000 QT007 QV012 XS005 XS005", 
"UJD02", "DV093 DV094 PN000 PN000 DV093 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG003 QG003 QG003 QG003 QG003 QG003 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007", 
"JCA55 JDA55", "JCA55 JDH35 UJD02", "DT022 DT022 DT022 DT022 DT022 DT022 DT022 DV051 DV053 DV055 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057", 
"DT022 DT022 DU007 DV057 DV057 DV057 DV057 DV057 DV057 DV058 DV058 DV058 DV076 DV076 DV076 DV076 DV076 DV076 GB003", 
"8841", "", "", "AF063", "JDC30", "AF063", "TJA10 DV093 DV094 PA000 PM000 PN000 PN000 PT000 QB001 QB001 QB001 QB003 QG003 QM000 QM000 QM000 QM015 QM017 QM017 QM017 QT007 XS011 XS011 XS910 XS910 XS910 XS910 XS910 XS910 XS910", 
"JKA21 TJK01", "XV015", "DR036 TJA10 TJA10 XV018")), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 12L, 13L, 17L, 36L, 40L, 41L, 
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 
55L, 56L, 57L, 58L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 
74L, 75L, 76L, 77L, 78L, 82L, 83L, 84L, 85L), class = "data.frame")

My (NON WORKING) attempt so far:

for (i in 1:length(pattern$single_codes)) {
  print(paste(which(str_split(database$multi_codes, boundary("word"))[[1]] == pattern$single_codes[i]), collapse=","))
}

CodePudding user response:

Here's one way to approach it.

func <- function(a, b) paste(which(a %in% b), collapse = ",")
out <- outer(strsplit(database$multi_codes, " "),
             setNames(nm = pattern$single_codes),
             function(a, b) mapply(func, a, b))

dim(out)
# [1] 50 39

out[32:37,20:30]
#      JDA38 QM007                              JDA52 JDA63 JDC00 JDC10 DT022 JDC20 JDC30 JDC40 JDC96
# [1,] ""    ""                                 ""    ""    ""    ""    ""    ""    ""    ""    ""   
# [2,] ""    ""                                 ""    "2"   ""    ""    ""    ""    ""    ""    ""   
# [3,] ""    "12,13,14,15,16"                   ""    ""    ""    ""    ""    ""    ""    ""    ""   
# [4,] ""    ""                                 ""    ""    ""    ""    ""    ""    ""    ""    ""   
# [5,] ""    "20,21,22,23,24,25,26,27,28,29,30" ""    ""    ""    ""    ""    ""    ""    ""    ""   
# [6,] ""    ""                                 ""    ""    ""    ""    ""    ""    ""    ""    ""   

Each row is a row from database, and each column is a string from pattern. This can be cbind'ed relatively straight-forward with:

database2 <- cbind(database, out)
database2[32:37,]
#      id                                                                                                                                                                         multi_codes JDW97 JDW98 JCA05 JCA38 JCA42 JCA45 AF021 JCA96 JCA98 JCC00
# 58 6255                                                                         AF063 AP029 AP051 DG017 DG021 DG023 DG024 DR029 DR055 DV065 GBB00 SP299 SP311 TJD10 TKC20 XV018 ZXH10 ZXH60                                                            
# 65 7290                                                                                                                                                 DR029 JDA63 JDH35 JFB43 JWA00 UJD02                                                            
# 66 7290 DV051 DV093 DV094 PM003 PN000 PN001 PT000 QB003 QG001 QG001 QG001 QM007 QM007 QM007 QM007 QM007 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QN000 QN000 QT007 QV012 XS005 XS005                                                            
# 67 7290                                                                                                                                                                               UJD02                                                            
# 68 7290 DV093 DV094 PN000 PN000 DV093 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG003 QG003 QG003 QG003 QG003 QG003 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007                                                            
# 69 7290                                                                                                                                                                         JCA55 JDA55                                                            
#    JCC10 TJA10 JCC20 TJK01 JCC96 DV093 JCD10 JCW98 JDA05 JDA38                            QM007 JDA52 JDA63 JDC00 JDC10 DT022 JDC20 JDC30 JDC40 JDC96 JDC97 AF063 JDD01 JDD96 JDW96 AF037 UJD02 UJD05 PN000
# 58                                                                                                                                                              1                                          
# 65                                                                                                        2                                                                                   6            
# 66                                   2                                           12,13,14,15,16                                                                                                           5
# 67                                                                                                                                                                                            1            
# 68                                 1,5                         20,21,22,23,24,25,26,27,28,29,30                                                                                                         3,4
# 69                                                                                                                                                                                                         

In dplyr, one can use bind_cols(database, out).

It might be useful to have NA instead of an empty string. For that, change to

func <- function(a, b) { o <- which(a %in% b); if (length(o)) paste(o, collapse = ",") else NA; }
out <- outer(strsplit(database$multi_codes, " "), setNames(nm = pattern$single_codes), function(a, b) mapply(func, a, b))
database2 <- cbind(database, out)

database2[32:37,]
#      id                                                                                                                                                                         multi_codes JDW97 JDW98 JCA05 JCA38 JCA42 JCA45 AF021 JCA96 JCA98 JCC00
# 58 6255                                                                         AF063 AP029 AP051 DG017 DG021 DG023 DG024 DR029 DR055 DV065 GBB00 SP299 SP311 TJD10 TKC20 XV018 ZXH10 ZXH60  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 65 7290                                                                                                                                                 DR029 JDA63 JDH35 JFB43 JWA00 UJD02  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 66 7290 DV051 DV093 DV094 PM003 PN000 PN001 PT000 QB003 QG001 QG001 QG001 QM007 QM007 QM007 QM007 QM007 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QN000 QN000 QT007 QV012 XS005 XS005  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 67 7290                                                                                                                                                                               UJD02  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 68 7290 DV093 DV094 PN000 PN000 DV093 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG003 QG003 QG003 QG003 QG003 QG003 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 69 7290                                                                                                                                                                         JCA55 JDA55  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#    JCC10 TJA10 JCC20 TJK01 JCC96 DV093 JCD10 JCW98 JDA05 JDA38                            QM007 JDA52 JDA63 JDC00 JDC10 DT022 JDC20 JDC30 JDC40 JDC96 JDC97 AF063 JDD01 JDD96 JDW96 AF037 UJD02 UJD05 PN000
# 58  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>                             <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 65  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>                             <NA>  <NA>     2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     6  <NA>  <NA>
# 66  <NA>  <NA>  <NA>  <NA>  <NA>     2  <NA>  <NA>  <NA>  <NA>                   12,13,14,15,16  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     5
# 67  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>                             <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     1  <NA>  <NA>
# 68  <NA>  <NA>  <NA>  <NA>  <NA>   1,5  <NA>  <NA>  <NA>  <NA> 20,21,22,23,24,25,26,27,28,29,30  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   3,4
# 69  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>                             <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>

  • Related