I have the following string in R code.
aas <- "QAWDIIKRIDKK"
And I want to check the longest continuous fragment of that string that contains the character in following vector:
hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")
The answer is:
AW, II
Other example:
QFILVMD -> FILVM
I need that to be very fast, coz need to test many strings. How can I do that in R?
CodePudding user response:
One option - split the string, replace the non-matching elements from the key vector to NA, do a group by paste
based on the NA
created, and subset the elements based on the max
imum number of characters
f1 <- function(str1, matchvec)
{
v1 <- strsplit(str1, "")[[1]]
v1[!v1 %in% matchvec] <- NA
v2 <- tapply(v1, with(rle(!is.na(v1)),
rep(seq_along(values), lengths)),
FUN = function(x) paste(x[!is.na(x)], collapse = ""))
unname(v2[nchar(v2) == max(nchar(v2))])
}
-testing
> f1(aas, hydrophobic_res)
[1] "AW" "II"
> f1("QFILVMD", hydrophobic_res)
[1] "FILVM"
A regex based option - create pattern to remove all those characters that are not in the matchvec with gsub
, split and subset based on the number of characters
f2 <- function(str1, matchvec)
{
pat <- sprintf("[^%s]", paste(matchvec, collapse = ""))
v1 <- strsplit(gsub(pat, ",", str1), ",")[[1]]
v1[nchar(v1) == max(nchar(v1))]
}
-testing
> f2(aas, hydrophobic_res)
[1] "AW" "II"
> f2("QFILVMD", hydrophobic_res)
[1] "FILVM"
CodePudding user response:
Here is an alternative way: For me it is easier to solve such kind of task in thinking of tibbles or data frames:
library(data.table)
library(dplyr)
str_split(aas, "")[[1]] %>%
as_tibble() %>%
mutate(flag = grepl(paste(hydrophobic_res, collapse = "|"), value)) %>%
group_by(group = rleid(flag==TRUE)) %>%
filter(flag == TRUE & max(row_number() > 1)) %>%
mutate(string = paste(value, collapse = "")) %>%
slice(1) %>%
pull(string)
[1] "AW" "II"
CodePudding user response:
As you mentioned speed is important, consider using stringi
which is optimized for this kind of task. An advantage is that it's easy to vectorize as well:
library(stringi)
find_longest <- function(strng, pat) {
pats <- if (is.list(pat)) {
sapply(pat, \(x) stri_join(c("[", x, "] "), collapse = ""))
} else {
stri_join(c("[", pat, "] "), collapse = "")
}
res <- stri_extract_all(strng, regex = pats)
lapply(res, \(x) {
nc <- nchar(x)
x[nc == max(nc)]
})
}
hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")
aas <- "QAWDIIKRIDKK"
aas2 <- "QFILVMD"
find_longest(c(aas, aas2), hydrophobic_res)
[[1]]
[1] "AW" "II"
[[2]]
[1] "FILVM"