I have a long DNA sequence text file with characters (ATCG). I am looking for some method in R that can be used to find the longest stretch with repeated words. Lets say my string looks like, AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA
I need the output possibly with counts, AAAAAAAAAAAAAAAA n=16
Please help me with this.
CodePudding user response:
Perhaps you can try this
> s <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"
> v <- regmatches(s, gregexpr("(.)\\1 ", s))[[1]]
> v[which.max(nchar(v))]
[1] "AAAAAAAAAAAAAAAA"
CodePudding user response:
First form a vector of all same base pair substrings. Then, find the longest string in that vector.
x <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"
parts <- unlist(strsplit(x, "(?<=([ACGT]))(?!\\1)"))
parts[order(-nchar(parts), parts)][1]
[1] "AAAAAAAAAAAAAAAA"
CodePudding user response:
Another possible solution:
library(tidyverse)
s <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"
s %>%
str_extract_all("([A-Z])\\1*") %>% map(str_count) %>% unlist %>% max
#> [1] 16
CodePudding user response:
if you have one string:
library(tidyverse)
string <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"
x <- str_extract_all(string, "(.)\\1 ")
x[which.max(nchar(x))]
[1] "AAAAAAAAAAAAAAAA"
if you have many strings:
str_extract_all(c(string, string), "(.)\\1 ")%>%
map_chr(~.x[which.max(nchar(.x))])
[1] "AAAAAAAAAAAAAAAA" "AAAAAAAAAAAAAAAA"
To find the counts, just use nchar
or even str_count
of the result