I'm trying to read in a fasta file. I want to remove/ignore the header/info lines that begin with ">" and store the following sequences into sperate strings. Below is the code I have to do that (partially reworked from https://rosettacode.org/wiki/FASTA_format#C , as what I had originally worked even less). They have a good example of what I want to do.
My problem is given this fasta file:
">sequence_1
MSTAGKVIKCKAAVLWELHKPFTIEDIEVAPPKAHEVRIKMVATGVCRSDDHVVSGTLVTPLPAVLGHE
GAGIVEGVTCVKPGDKVIPLFSPQCGECRICKHPESNFCSRSDLLMPRGTLREGTSRFSCKGKQIHNFI
STSTFSQYTVVDDIAVAKIDGASPLDKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVIIG
CKAAGAARIIAVDINKDKFAKAKELGATECIYSKPIQEVLQEMTDGGVDFSFEVIGRLDTMTSALLSCH
AACGVSVVVGVPPNAQNLSMNPMLLLLGRTWKGAIFGGFKSKDSVPKLVAKKFPLDPLITHVLPFEKIN
EAFDLLRSGKSIRTVLTF
">sequence_2
MNQGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICHTDDHVVSGNLVTPLPVILGHEA
AGIVESVGEGVTTVKPGDKVIPLFTCRVCKNPESNYCLKNDLGNPRGTLQDGTRRFTCRGKPIHHFLGT
STFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCK
AAGAARIIAVDINKDKFAKAKELGATECINPQDYKLPIQEVLKEMTDGSTVIGRLDTMMASLLCCGTSV
IVEDTPASQNLSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGF
DLLHSGKSIRTVLTF
My output:
Sequence 1: MSTAGKVIKCKAAVLWELHKPFTIEDIEVAPPKAHEVRIKMVATGVCRSDDHVVSGTLVTPLPAVLGHEGAGIVEGVTCVKPGDKVIPLFSPQCGECRICKHPESNFCSRSDLLMPRGTLREGTSRFSCKGKQIHNFISTSTFSQYTVVDDIAVAKIDGASPLDKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVIIGCKAAGAARIIAVDINKDKFAKAKELGATECIYSKPIQEVLQEMTDGGVDFSFEVIGRLDTMTSALLSCHAACGVSVVVGVPPNAQNLSMNPMLLLLGRTWKGAIFGGFKSKDSVPKLVAKKFPLDPLITHVLPFEKINEAFDLLRSGKSIRTVLTF
Sequence 2: MNQGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICHTDDHVVSGNLVTPLPVILGHEAAGIVESVGEGVTTVKPGDKVIPLFTCRVCKNPESNYCLKNDLGNPRGTLQDGTRRFTCRGKPIHHFLGTSTFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCKAAGAARIIAVDINKDKFAKAKELGATECINPQDYKLPIQEVLKEMTDGSTVIGRLDTMMASLLCCGTSVIVEDTPASQNLSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGF
The last line or so of Sequence 2 is cut off..... Any help/solutions?
void read_in_Protein(string Protein_filename)
{ // read in the sequences
fstream myfile;
myfile.open(Protein_filename, ios::in);
if (!myfile.is_open()) {
cerr << "Error can not open file" << endl;
exit(1);
}
string Protein_Sequences{};
string Protein_Seq_names{};
// string temp{};
string Prot_Seq1{};
string Prot_Seq2{};
string line{};
while (getline(myfile, line).good()) {
//std::cout << "Line input received (" << line.length() << "): " << line << std::endl;
if (line.empty() || line[0] == '>') { // Identifier marker
if (!Protein_Seq_names.empty()) { // Print out what we read from the last entry
//std::cout << "\tReseting to new sequence" << std::endl;
// cout << Protein_Sequences << endl;
Protein_Seq_names.clear();
Prot_Seq1 = Protein_Sequences;
}
if (!line.empty()) {
//std::cout << "\tSetting sequence start" << std::endl;
Protein_Seq_names = line.substr(1);
}
// std::cout << "\tClearing sequences..." << std::endl;
Protein_Sequences.clear();
}
else if (!Protein_Seq_names.empty()) {
line = line.substr(0, line.length() - 1);
if (line.find(' ') != string::npos) { // Invalid sequence--no spaces allowed
//std::cout << "\tSpace found, clearing buffers..." << std::endl;
Protein_Seq_names.clear();
Protein_Sequences.clear();
}
else {
//std::cout << "\tAppending line to protein sequence..." << std::endl;
Protein_Sequences = line;
}
}
//std::cout << "Protein_Sequences: " << Protein_Sequences << std::endl;
}
if (!Protein_Seq_names.empty()) { // Print out what we read from the last entry
// cout << Protein_Sequences << endl;
Prot_Seq2 = Protein_Sequences;
}
cout << "\nSequence 1: " << Prot_Seq1 << endl;
cout << Prot_Seq1.length();
cout << "\nSequence 2: " << Prot_Seq2 << endl;
cout << Prot_Seq2.length();
}
CodePudding user response:
Assuming your file doesn't end with a new line then the last call to std::getline
will set the eof
bit to indicate that it reached the end of the file before finding the line ending. As you are checking .good()
in your while loop the last line will be discarded. You should instead check !fail()
(or just the boolean value of the stream itself which is equivalent to !fail()
):
while (getline(myfile, line))
After reading the final line the next iteration of the loop will try to read whilst the stream is in the eof
state and immediately fail and break out of the loop.