Identifying mutations between two sequences-CodePudding

Given two sequences (such as DNA sequences) of equal length, I want to be able to find the mutations between them - both type and index within the sequence. As an example, if I fed in the sequences AGGCTAC and AGCCTTC, I want to get out a list like G3C, A6T.

I can get the number of differences just fine:

seqs <- Biostrings::readAAStringSet("PSE-1_Round20.fas")
nmut <- adist(seqs[1], seqs)

But I can't think of a more elegant way to get the positions than just looping, which seems very kludgy - and I'd like to take this as an opportunity to learn instead.

I'm working with the Biostrings package in R, but there don't seem to be any special tools in that package for what I want to do any I think any solution that works for generic strings should also work for me. In fact, if there's a more elegant solution in python or bash scripting, I'd also accept that.

CodePudding user response：

There seem to be multiple packages that should do this. One is the findMutations function in the adegenet package.

As for the string comparison question, see this question. Here's a function that will work if the strings are the same length:

mutations <- function(str1, str2) {
  str1vec <- unlist(strsplit(str1, ""))
  str2vec <- unlist(strsplit(str2, ""))
  iMut <- (1:length(str1vec))[str1vec != str2vec]
  return(paste0(str1vec[iMut], iMut, str2vec[iMut]))
}

> mutations("AGGCTAC", "AGCCTTC")
[1] "G3C" "A6T"