Say I have the following sequence of letters which represent a sequence in a gene: 5’ CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG 3’
I want to identify a specific ordered sequence, length of 20 for instance, within the above sequence that follows the parameters:
Position 1: cannot be a G
Position 10: must be an A or T
I am only including two parameters above as a way to simplify the problem.
I start with the following, converting the sequence into a vector.
exon_2 <- "CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG"
exon_2_vector <- as_vector(str_split_fixed(exon_2, pattern = "", n = nchar(exon_2)))
From here, I have been trying to start at position 1, and use if/else statements, and then if position 1 does not work, moving on to position 2, but I am struggling with the workflow and not the greatest at coding.
Any suggestions would be greatly appreciated.
CodePudding user response:
A solution using ngrams:
#install.packages('ngram')
library(ngram)
library(stringr)
library(dplyr)
library(purrr)
exon_2 <- "CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG"
exon_2_ws <- exon_2 %>% str_replace_all('(\\w)', '\\1 ')
ngram = exon_2_ws %>% ngram(n = 20) %>% get.ngrams %>% as_tibble %>%
mutate(value = value %>% str_replace_all(' ', ''),
identification = if_else( !substring(value,1,1) == 'G' & substring(value, 10,10) %in% c('A', 'T') , 1, 0))
Output:
# A tibble: 62 x 2
value identification
<chr> <dbl>
1 ATCGGCTGTGCGATTGTCGT 0
2 CGGCTGTGCGATTGTCGTCA 0
3 GCTATAACCGTGGCGACTCG 0
4 CGGTGGCTATAACCGTGGCG 1
5 GGCCATCGGCTGTGCGATTG 0
6 TAACCGTGGCGACTCGATAA 0
7 ATCGGTGGCTATAACCGTGG 1
8 GGCTGTGCGATTGTCGTCAT 0
9 CGATTGTCGTCATCGGTGGC 1
10 GATTGTCGTCATCGGTGGCT 0
# ... with 52 more rows
You could then filter valid sequences like this:
ngram %>% filter(identification == 1) %>% pull(value)
Which brings:
[1] "CGGTGGCTATAACCGTGGCG" "ATCGGTGGCTATAACCGTGG" "CGATTGTCGTCATCGGTGGC" "TACTGGCCATCGGCTGTGCG" "CTCGATAAGAGTCCGCCCTC" "CTATAACCGTGGCGACTCGA"
[7] "TTGTCGTCATCGGTGGCTAT" "CTGTGCGATTGTCGTCATCG" "ACCGTGGCGACTCGATAAGA" "CGACTCGATAAGAGTCCGCC" "CATCGGCTGTGCGATTGTCG" "CGATAAGAGTCCGCCCTCCC"
[13] "ATTGTCGTCATCGGTGGCTA" "TCGGTGGCTATAACCGTGGC" "CGTCATCGGTGGCTATAACC" "CGTGGCGACTCGATAAGAGT"