Extract strings based on multiple patterns-CodePudding

I have thousands of DNA sequences that look like this :).

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

I need to extract every sequence between the CTACG and CAGTC. However, many cases in these sequences come with an error (deletion, insertion, substitution). Is there any way to account for mismatches based on Levenshtein distance?

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

qdapRegex::ex_between(ref, "CTACG", "CAGTC")
#> [[1]]
#> [1] "GTTATGTACGATTAAAGAAGATCGT"
#> 
#> [[2]]
#> [1] "CGTTGATATTTTGCATGCTTACTCC"
#> 
#> [[3]]
#> [1] NA

reprex()
#> Error in reprex(): could not find function "reprex"

^{Created on 2021-12-18 by the reprex package (v2.0.1)}

Like this I would be able to extract the sequence also in the last case.

UPDATE: can I create a dictionary with a certain Levenshtein distance and then match it to each sequence?

CodePudding user response：

Using aregexec, build a regex pattern with sprintf, and finally removing the matches using gsub. Putting it into a Vectorized function to avoid overloading the script with lapplys or loops.

In the regex, the .* refers to everything before (resp. after) the respective letters. Note, that you probably need to adapt the max.distance= with your real data.

fu <- Vectorize(function(x) {
  p1 <- regmatches(x, aregexec('.*CTACG', x, max.distance=0.1))
  p2 <- regmatches(x, aregexec('CAGTC.*', x, max.distance=0.1))
  gsub(sprintf('%s|%s', p1, p2), '', x, perl=TRUE)
})

fu(ref)
# CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC 
#          "GTTATGTACGATTAAAGAAGATCGT"          "CGTTGATATTTTGCATGCTTACTCC" 
# CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC 
#          "CGTTGATATTTTGCATGCTTACTCC"

Data:

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")