Home > front end >  Reading in fasta file C
Reading in fasta file C

Time:09-29

I'm trying to read in a fasta file. I want to remove/ignore the header/info lines that begin with ">" and store the following sequences into sperate strings. Below is the code I have to do that (partially reworked from https://rosettacode.org/wiki/FASTA_format#C , as what I had originally worked even less). They have a good example of what I want to do.

My problem is given this fasta file:

">sequence_1  
MSTAGKVIKCKAAVLWELHKPFTIEDIEVAPPKAHEVRIKMVATGVCRSDDHVVSGTLVTPLPAVLGHE
GAGIVEGVTCVKPGDKVIPLFSPQCGECRICKHPESNFCSRSDLLMPRGTLREGTSRFSCKGKQIHNFI
STSTFSQYTVVDDIAVAKIDGASPLDKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVIIG
CKAAGAARIIAVDINKDKFAKAKELGATECIYSKPIQEVLQEMTDGGVDFSFEVIGRLDTMTSALLSCH
AACGVSVVVGVPPNAQNLSMNPMLLLLGRTWKGAIFGGFKSKDSVPKLVAKKFPLDPLITHVLPFEKIN
EAFDLLRSGKSIRTVLTF

">sequence_2  
MNQGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICHTDDHVVSGNLVTPLPVILGHEA
AGIVESVGEGVTTVKPGDKVIPLFTCRVCKNPESNYCLKNDLGNPRGTLQDGTRRFTCRGKPIHHFLGT
STFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCK
AAGAARIIAVDINKDKFAKAKELGATECINPQDYKLPIQEVLKEMTDGSTVIGRLDTMMASLLCCGTSV
IVEDTPASQNLSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGF
DLLHSGKSIRTVLTF

My output:

Sequence 1: MSTAGKVIKCKAAVLWELHKPFTIEDIEVAPPKAHEVRIKMVATGVCRSDDHVVSGTLVTPLPAVLGHEGAGIVEGVTCVKPGDKVIPLFSPQCGECRICKHPESNFCSRSDLLMPRGTLREGTSRFSCKGKQIHNFISTSTFSQYTVVDDIAVAKIDGASPLDKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVIIGCKAAGAARIIAVDINKDKFAKAKELGATECIYSKPIQEVLQEMTDGGVDFSFEVIGRLDTMTSALLSCHAACGVSVVVGVPPNAQNLSMNPMLLLLGRTWKGAIFGGFKSKDSVPKLVAKKFPLDPLITHVLPFEKINEAFDLLRSGKSIRTVLTF  

Sequence 2: MNQGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICHTDDHVVSGNLVTPLPVILGHEAAGIVESVGEGVTTVKPGDKVIPLFTCRVCKNPESNYCLKNDLGNPRGTLQDGTRRFTCRGKPIHHFLGTSTFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCKAAGAARIIAVDINKDKFAKAKELGATECINPQDYKLPIQEVLKEMTDGSTVIGRLDTMMASLLCCGTSVIVEDTPASQNLSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGF

The last line or so of Sequence 2 is cut off..... Any help/solutions?

void read_in_Protein(string Protein_filename)
{ // read in the sequences
    fstream myfile;
    myfile.open(Protein_filename, ios::in);

    if (!myfile.is_open()) {
        cerr << "Error can not open file" << endl;
        exit(1);
    }

    string Protein_Sequences{};
    string Protein_Seq_names{};

    //        string temp{};

    string Prot_Seq1{};
    string Prot_Seq2{};

    string line{};

    while (getline(myfile, line).good()) {
        //std::cout << "Line input received (" << line.length() << "): " << line << std::endl;
        if (line.empty() || line[0] == '>') { // Identifier marker
            if (!Protein_Seq_names.empty()) { // Print out what we read from the last entry
                //std::cout << "\tReseting to new sequence" << std::endl;
                //                      cout << Protein_Sequences << endl;
                Protein_Seq_names.clear();
                Prot_Seq1 = Protein_Sequences;
            }
            if (!line.empty()) {
                //std::cout << "\tSetting sequence start" << std::endl;
                Protein_Seq_names = line.substr(1);
            }
            //          std::cout << "\tClearing sequences..." << std::endl;
            Protein_Sequences.clear();
        }
        else if (!Protein_Seq_names.empty()) {
            line = line.substr(0, line.length() - 1);
            if (line.find(' ') != string::npos) { // Invalid sequence--no spaces allowed
                //std::cout << "\tSpace found, clearing buffers..." << std::endl;
                Protein_Seq_names.clear();
                Protein_Sequences.clear();
            }
            else {
                //std::cout << "\tAppending line to protein sequence..." << std::endl;
                Protein_Sequences  = line;
            }
        }
        //std::cout << "Protein_Sequences: " << Protein_Sequences << std::endl;
    }

    if (!Protein_Seq_names.empty()) { // Print out what we read from the last entry
        //          cout << Protein_Sequences << endl;
        Prot_Seq2 = Protein_Sequences;
    }

    cout << "\nSequence 1: " << Prot_Seq1 << endl;
    cout << Prot_Seq1.length();
    cout << "\nSequence 2: " << Prot_Seq2 << endl;
    cout << Prot_Seq2.length();
}

CodePudding user response:

Assuming your file doesn't end with a new line then the last call to std::getline will set the eof bit to indicate that it reached the end of the file before finding the line ending. As you are checking .good() in your while loop the last line will be discarded. You should instead check !fail() (or just the boolean value of the stream itself which is equivalent to !fail()):

while (getline(myfile, line))

After reading the final line the next iteration of the loop will try to read whilst the stream is in the eof state and immediately fail and break out of the loop.

  • Related