Home > Mobile >  Finding the longest stretch of repeated words in a long string of characters
Finding the longest stretch of repeated words in a long string of characters

Time:04-24

I have a long DNA sequence text file with characters (ATCG). I am looking for some method in R that can be used to find the longest stretch with repeated words. Lets say my string looks like, AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA

I need the output possibly with counts, AAAAAAAAAAAAAAAA n=16

Please help me with this.

CodePudding user response:

Perhaps you can try this

> s <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"

> v <- regmatches(s, gregexpr("(.)\\1 ", s))[[1]]

> v[which.max(nchar(v))]
[1] "AAAAAAAAAAAAAAAA"

CodePudding user response:

First form a vector of all same base pair substrings. Then, find the longest string in that vector.

x <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"
parts <- unlist(strsplit(x, "(?<=([ACGT]))(?!\\1)"))
parts[order(-nchar(parts), parts)][1]

[1] "AAAAAAAAAAAAAAAA"

CodePudding user response:

Another possible solution:

library(tidyverse)

s <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"

s %>% 
  str_extract_all("([A-Z])\\1*") %>% map(str_count) %>% unlist %>% max

#> [1] 16

CodePudding user response:

if you have one string:

library(tidyverse)
string <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"

x <- str_extract_all(string, "(.)\\1 ")
x[which.max(nchar(x))]

[1] "AAAAAAAAAAAAAAAA"

if you have many strings:

str_extract_all(c(string, string), "(.)\\1 ")%>%
  map_chr(~.x[which.max(nchar(.x))])

[1] "AAAAAAAAAAAAAAAA" "AAAAAAAAAAAAAAAA"

To find the counts, just use nchar or even str_count of the result

  • Related