Home > Blockchain >  How to find longest continuous character in a string based on a given vector
How to find longest continuous character in a string based on a given vector

Time:02-02

I have the following string in R code.

aas <- "QAWDIIKRIDKK"

And I want to check the longest continuous fragment of that string that contains the character in following vector:

  hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")

The answer is:

AW, II

Other example:

QFILVMD -> FILVM

I need that to be very fast, coz need to test many strings. How can I do that in R?

CodePudding user response:

One option - split the string, replace the non-matching elements from the key vector to NA, do a group by paste based on the NA created, and subset the elements based on the maximum number of characters

f1 <- function(str1, matchvec)
{
v1 <- strsplit(str1, "")[[1]]
v1[!v1 %in% matchvec] <- NA
v2 <- tapply(v1, with(rle(!is.na(v1)),
      rep(seq_along(values), lengths)),
   FUN = function(x) paste(x[!is.na(x)], collapse = ""))
unname(v2[nchar(v2) == max(nchar(v2))])


}

-testing

> f1(aas, hydrophobic_res)
[1] "AW" "II"
> f1("QFILVMD", hydrophobic_res)
[1] "FILVM"

A regex based option - create pattern to remove all those characters that are not in the matchvec with gsub, split and subset based on the number of characters

f2 <- function(str1, matchvec)
  {
  pat <- sprintf("[^%s]", paste(matchvec, collapse = ""))
  v1 <- strsplit(gsub(pat, ",", str1), ",")[[1]]
  v1[nchar(v1) == max(nchar(v1))]
}

-testing

> f2(aas, hydrophobic_res)
[1] "AW" "II"
> f2("QFILVMD", hydrophobic_res)
[1] "FILVM"

CodePudding user response:

Here is an alternative way: For me it is easier to solve such kind of task in thinking of tibbles or data frames:

library(data.table)
library(dplyr)
str_split(aas, "")[[1]] %>% 
  as_tibble() %>% 
  mutate(flag = grepl(paste(hydrophobic_res, collapse = "|"), value)) %>% 
  group_by(group = rleid(flag==TRUE)) %>% 
  filter(flag == TRUE & max(row_number() > 1)) %>% 
  mutate(string = paste(value, collapse = "")) %>% 
  slice(1) %>% 
  pull(string)
[1] "AW" "II"

CodePudding user response:

As you mentioned speed is important, consider using stringi which is optimized for this kind of task. An advantage is that it's easy to vectorize as well:

library(stringi)

find_longest <- function(strng, pat) {
  pats <- if (is.list(pat)) {
    sapply(pat, \(x) stri_join(c("[", x, "] "), collapse = ""))
  } else {
    stri_join(c("[", pat, "] "), collapse = "")
  }
  res <- stri_extract_all(strng, regex = pats)
  lapply(res, \(x) {
    nc <- nchar(x)
    x[nc == max(nc)]
  })
}

hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")
aas <- "QAWDIIKRIDKK"
aas2 <- "QFILVMD"


find_longest(c(aas, aas2), hydrophobic_res)

[[1]]
[1] "AW" "II"

[[2]]
[1] "FILVM"
  • Related