I am trying to match all possible substrings between two column vectors in R, but have not been successful.
I have a long vector of strings which is as follows:
S = c('123_INTEL_I5_VPRO', '531_CORE_dfds', '93_RAYZEN_29dad', '452_VPROL_I9', NA)
and another vector containing parts: V = c('INTEL','CORE', 'VPRO', 'I5', 'I9')
My desired output is as follows: c("INTELI5VPRO", "CORE", NA, "VPROI9", NA)
Based on some of the earlier questions, I have tried the following:
library(stringr)
str_extract(S, paste(V, collapse = "|"))
OR str_extract(S, paste(V, collapse = "INTEL|CORE|VPRO|I5|I9"))
In the first case, the result has been c("INTEL" "CORE" NA "VPRO" NA)
, while in the second case, the result is c("I5" "CORE" NA "VPRO" NA)
I have also tried by removing the whitespace or underscore in the S column vector, but it has not worked.
Any help would be greatly appreciated. Thank you!
CodePudding user response:
Here we create x
and then use grepl
:
library(stringr)
x <- str_replace_all(str_remove(S, '(\\d \\_)'), '\\_', '')
x[grepl(paste0(V, collapse = "|"), x)]
[1] "INTELI5VPRO" "COREdfds" "VPROLI9"
CodePudding user response:
str_extract_all
gives you a matrix of hits. Concatenating the strings of each row almost gives you your desired result. Only the third item is ""
instead of NA
.
library(stringr)
S = c('123_INTEL_I5_VPRO', '531_CORE_dfds', '93_RAYZEN_29dad', '452_VPROL_I9', NA)
V = c('INTEL','CORE', 'VPRO', 'I5', 'I9')
matches <- sapply(V, function (x) str_extract_all(S, x))
result <- apply(matches, 1, function(x) str_flatten(unlist(x))) # concatenate rows
result[result == ""] <- NA
result
#> [1] "INTELVPROI5" "CORE" NA "VPROI9" NA
Created on 2022-06-30 by the reprex package (v2.0.1)
CodePudding user response:
You can do follow your original approach, but using str_extract_all
and sapply()
, like this:
sapply(str_extract_all(S, paste(V, collapse = "|")),paste0, collapse="")
Output
[1] "INTELI5VPRO" "CORE" "" "VPROI9" "NA"
Or, you can do something like this:
lapply(S, \(s) {
x = strsplit(s, "_")[[1]]
result = paste0(x[x %in% V], collapse="")
ifelse(result=="", as.character(NA),result)
}) %>% unlist()
Output
[1] "INTELI5VPRO" "CORE" NA "I9" NA
CodePudding user response:
You'd want to use str_extract_all
and take care of the empty extractions like the one in position 3 (based on your code):
sapply(str_extract_all(S, paste(V, collapse = "|")),
function(x) ifelse(length(x) != 0, str_flatten(x), NA)
)
#> [1] "INTELI5VPRO" "CORE" NA "VPROI9" NA