Given two sequences (such as DNA sequences) of equal length, I want to be able to find the mutations between them - both type and index within the sequence. As an example, if I fed in the sequences AGGCTAC
and AGCCTTC
, I want to get out a list like G3C, A6T
.
I can get the number of differences just fine:
seqs <- Biostrings::readAAStringSet("PSE-1_Round20.fas")
nmut <- adist(seqs[1], seqs)
But I can't think of a more elegant way to get the positions than just looping, which seems very kludgy - and I'd like to take this as an opportunity to learn instead.
I'm working with the Biostrings
package in R, but there don't seem to be any special tools in that package for what I want to do any I think any solution that works for generic strings should also work for me. In fact, if there's a more elegant solution in python or bash scripting, I'd also accept that.
CodePudding user response:
There seem to be multiple packages that should do this. One is the findMutations
function in the adegenet
package.
As for the string comparison question, see this question. Here's a function that will work if the strings are the same length:
mutations <- function(str1, str2) {
str1vec <- unlist(strsplit(str1, ""))
str2vec <- unlist(strsplit(str2, ""))
iMut <- (1:length(str1vec))[str1vec != str2vec]
return(paste0(str1vec[iMut], iMut, str2vec[iMut]))
}
> mutations("AGGCTAC", "AGCCTTC")
[1] "G3C" "A6T"