I am in a situation where I have so many fastq files that I want to convert to fasta. Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
CodePudding user response:
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
I would use GNU sed
following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed
is at last line ($
) append newline (\n
) - this action is undertaken before default one of printing. If you prefer GNU AWK
then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1
is always name of file existing in current working directory.