Home > database >  Match multiple substrings from another list of all possible substrings in R
Match multiple substrings from another list of all possible substrings in R

Time:07-01

I am trying to match all possible substrings between two column vectors in R, but have not been successful.

I have a long vector of strings which is as follows: S = c('123_INTEL_I5_VPRO', '531_CORE_dfds', '93_RAYZEN_29dad', '452_VPROL_I9', NA)

and another vector containing parts: V = c('INTEL','CORE', 'VPRO', 'I5', 'I9')

My desired output is as follows: c("INTELI5VPRO", "CORE", NA, "VPROI9", NA)

Based on some of the earlier questions, I have tried the following:

library(stringr)
str_extract(S, paste(V, collapse = "|"))

OR str_extract(S, paste(V, collapse = "INTEL|CORE|VPRO|I5|I9"))

In the first case, the result has been c("INTEL" "CORE" NA "VPRO" NA), while in the second case, the result is c("I5" "CORE" NA "VPRO" NA)

I have also tried by removing the whitespace or underscore in the S column vector, but it has not worked.

Any help would be greatly appreciated. Thank you!

CodePudding user response:

Here we create x and then use grepl:

library(stringr)

x <- str_replace_all(str_remove(S, '(\\d \\_)'), '\\_', '')

x[grepl(paste0(V, collapse = "|"), x)]
[1] "INTELI5VPRO" "COREdfds"    "VPROLI9" 

CodePudding user response:

str_extract_all gives you a matrix of hits. Concatenating the strings of each row almost gives you your desired result. Only the third item is "" instead of NA.

library(stringr)

S = c('123_INTEL_I5_VPRO', '531_CORE_dfds', '93_RAYZEN_29dad', '452_VPROL_I9', NA)
V = c('INTEL','CORE', 'VPRO', 'I5', 'I9')


matches <- sapply(V, function (x) str_extract_all(S, x))
result <- apply(matches, 1, function(x) str_flatten(unlist(x))) # concatenate rows
result[result == ""] <- NA
result
#> [1] "INTELVPROI5" "CORE"        NA            "VPROI9"      NA

Created on 2022-06-30 by the reprex package (v2.0.1)

CodePudding user response:

You can do follow your original approach, but using str_extract_all and sapply(), like this:

sapply(str_extract_all(S, paste(V, collapse = "|")),paste0, collapse="")

Output

[1] "INTELI5VPRO" "CORE"        ""            "VPROI9"      "NA"         

Or, you can do something like this:

lapply(S, \(s) {
    x = strsplit(s, "_")[[1]]
    result = paste0(x[x %in% V], collapse="")
    ifelse(result=="", as.character(NA),result)
}) %>% unlist()

Output

[1] "INTELI5VPRO" "CORE"        NA            "I9"          NA  

CodePudding user response:

You'd want to use str_extract_all and take care of the empty extractions like the one in position 3 (based on your code):

sapply(str_extract_all(S, paste(V, collapse = "|")),
       function(x) ifelse(length(x) != 0, str_flatten(x), NA)
       )

#> [1] "INTELI5VPRO" "CORE"        NA            "VPROI9"      NA           
  • Related