R- How would I identify a specific ordered sequence of characters that fits certain parameters?-CodePudding

Say I have the following sequence of letters which represent a sequence in a gene: 5’ CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG 3’

I want to identify a specific ordered sequence, length of 20 for instance, within the above sequence that follows the parameters:

Position 1: cannot be a G
Position 10: must be an A or T

I am only including two parameters above as a way to simplify the problem.

I start with the following, converting the sequence into a vector.

exon_2 <- "CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG"

exon_2_vector <- as_vector(str_split_fixed(exon_2, pattern = "", n = nchar(exon_2)))

From here, I have been trying to start at position 1, and use if/else statements, and then if position 1 does not work, moving on to position 2, but I am struggling with the workflow and not the greatest at coding.

Any suggestions would be greatly appreciated.

CodePudding user response：

A solution using ngrams:

#install.packages('ngram')
library(ngram)
library(stringr)
library(dplyr)
library(purrr)
exon_2 <- "CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG"
exon_2_ws <- exon_2  %>% str_replace_all('(\\w)', '\\1 ')
ngram = exon_2_ws %>% ngram(n = 20)  %>% get.ngrams %>% as_tibble %>%
  mutate(value = value %>% str_replace_all(' ', ''),
         identification = if_else( !substring(value,1,1) == 'G' & substring(value, 10,10) %in% c('A', 'T') , 1, 0))

Output:

# A tibble: 62 x 2
   value                identification
   <chr>                         <dbl>
 1 ATCGGCTGTGCGATTGTCGT              0
 2 CGGCTGTGCGATTGTCGTCA              0
 3 GCTATAACCGTGGCGACTCG              0
 4 CGGTGGCTATAACCGTGGCG              1
 5 GGCCATCGGCTGTGCGATTG              0
 6 TAACCGTGGCGACTCGATAA              0
 7 ATCGGTGGCTATAACCGTGG              1
 8 GGCTGTGCGATTGTCGTCAT              0
 9 CGATTGTCGTCATCGGTGGC              1
10 GATTGTCGTCATCGGTGGCT              0
# ... with 52 more rows

You could then filter valid sequences like this:

ngram %>% filter(identification == 1) %>% pull(value)

Which brings:

 [1] "CGGTGGCTATAACCGTGGCG" "ATCGGTGGCTATAACCGTGG" "CGATTGTCGTCATCGGTGGC" "TACTGGCCATCGGCTGTGCG" "CTCGATAAGAGTCCGCCCTC" "CTATAACCGTGGCGACTCGA"
 [7] "TTGTCGTCATCGGTGGCTAT" "CTGTGCGATTGTCGTCATCG" "ACCGTGGCGACTCGATAAGA" "CGACTCGATAAGAGTCCGCC" "CATCGGCTGTGCGATTGTCG" "CGATAAGAGTCCGCCCTCCC"
[13] "ATTGTCGTCATCGGTGGCTA" "TCGGTGGCTATAACCGTGGC" "CGTCATCGGTGGCTATAACC" "CGTGGCGACTCGATAAGAGT"