I am working with DNA sequence data in fasta files, and have to work only in R for this project. I do some manipulations using the seqinr package (selecting a subset of sequences, altering the fasta headers etc). For the next stage in the analysis I want to do a multiple sequence alignment, and have used the msa R package. I can get msa working if I import a fasta file, but I'm struggling to move directly within R from the seqinr list object to the Biostrings DNAStringSet object that I have used as input for msa.
Example data - Assume that fasta_file.fasta is a fasta file with contents as follows:
>seq1
ATATATAT
>seq2
CGCGCGCG
>seq3
ATATCGCG
>seq4
ATATATAT
The code I've used is:
# Load packages
library(tidyverse)
library(seqinr)
library(msa)
library(adegenet)
library(bios2mds)
library(ape)
library(ggtree)
# Import sequences (using seqinr)
sequences <- read.fasta("fasta_file.fasta")
# Define sequences for selection
seqs_select <- c("seq1", "seq2", "seq3")
# Select chosen sequences
seq_sub <- sequences[names(sequences) %in% seqs_select]
# Check the number of sequences left
length(sequences) # 4 original
length(seq_sub) # 3 after selection
# Run multiple sequence alignment using msa
seq_alignment <- msa(seq_sub, method="ClustalOmega") # Generates an error message because seq_sub is the wrong input
msa works if I import the fasta file straight away as a Biostrings DNAstringset object:
# Import fasta file directly as Biostrings object
seq_dnastring <- readDNAStringSet("fasta_file.fasta")
seq_alignment <- msa(seq_dnastring, method="ClustalOmega")
I could save the processed fasta file I make using seqinr and then re-load it using readDNAStringSet, but my question is whether there is a way to convert directly from seq_sub to something that msa can use as input to run an alignment with. ie. to turn seq_sub format into seq_dnastring format. Thanks for the help.
CodePudding user response:
Another option is to process your sequences in Biostrings, instead of seqinr. Nonetheless, I think this does the trick
library(Biostrings)
library(seqinr)
FUN = function(x)
paste(getSequence(x), collapse = "")
as(vapply(sequences, FUN, character(1)), "DNAStringSet")