Home > Back-end >  Missing last character when reading FASTA file [c ]
Missing last character when reading FASTA file [c ]

Time:11-08

I'm writing this program that reads in a fasta file to do some stuff with. The format of fasta file is like this:

> This line with ">" is the header, want to skip/ignore this line
These lines below the header has sequence information we want
ATTGGTATGATTTACCCAATTTGGGGAAAAAATTCCCTCTCGATAGCTATCCTGATTTGCGG
ATTGGTATGATTTACCCAATTTGGGGAAAAAATTCCCTCTCGATAGCTATCCTGATTTGCGG
ATTGGTATGATTTACCCAATTTGGGGAAAAAATTCCCTCTCGATAGCTATCCTGATTTGCGG

Ideally the program I have should read in the fasta file skip the header line and inputting the sequence below into a string. My code does this except at the very end it leaves of the last character. In the example above everything would be added except the last G of the last line. Here is my code and an example file:

void reading_in_RNA_file()
{

    string RNA_file = "sample_query.txt";

    ifstream fin;
        fin.open(RNA_file);
        if (!fin.is_open())
        {//if
                cerr << "Error did not open file" << endl;
                exit(1);
        }//if

    string line = "";
    string RNA_seq = "";
    string FASTA_heading = "";
    string sequence = "";

        while(getline(fin,line))
        {
                if( line.empty() || line[0] == '>' )
                { // Identifier marker
                        if(!FASTA_heading.empty() )
                        { // Print out what we read from the last entry
                                FASTA_heading.clear();
                                RNA_seq  = sequence;
                        }

                        if( !line.empty() )
                        {
                                 FASTA_heading = line.substr(1);
                        }
                        sequence.clear();
                }

                else if(!FASTA_heading.empty())
                {
                        line = line.substr(0, line.length() -1);
                        if(line.find(' ') != string::npos )
                        { // Invalid sequence--no spaces allowed
                                 FASTA_heading.clear();
                                sequence.clear();
                        }

                        else
                        {
                            sequence  = line;
                        }
                }
        }

        if(!FASTA_heading.empty() )
        { // Print out what we read from the last entry
                RNA_seq  = sequence;
        }
    cout << RNA_seq << endl;
}

sample_query.txt fasta file!

> true positive test query
GTCTGAGAAAACAAGGCTAGAGATTCCAATATTAGAGACAACAGGGCTCTGGGAAGATTAAGGTTGAGTT
TTCTGGATCTGCAGAATAGAGTCACTGAGGACCAATTGCAAGATCAGAGGAGATGAAAGAACAAGTCAAG
GCATGCTTAGGAAAAGAGAATATCAGGGATAGGTTTTAGGCAAGAGTCACACTGAGGAAGGGCAGGTTCT
ACATACAGTTTATCTTGGTACTGCCAAGTACCATTTGGGTCAGGATTTTGTCATTTAGATCCATATTTTT
CCTATATTTTTATCTGGTTCTTCCATCAGTTACTGAGAGAGCACTATTAATTCACCAGCTATAATTTTGG
ATTGTCAATTTCCTGCTTTTGTCTGTTGTTTTTGATTCACATACTTTGAGGCTCTGTGTGTGTGTGTAAT

anyone know why I'm having this issue??

CodePudding user response:

Mistake seems to be here

line = line.substr(0, line.length() -1);

Obviously this removes the last character from line (incidentally line.pop_back() is a simpler, more efficient way to do that).

I guess you are under the impression that getline leaves the newline character in the string it reads, and you are trying to remove that newline. But this is not true, getline does read the newline but does not include that newline in the string it returns.

CodePudding user response:

It seems that you lost every “G” from each line. It may be related to the method below,

line = line.substr(0, line. Length() -1);

To

line = line.substr(0, line. Length());

Please refer to the official sample below,

https://cplusplus.com/reference/string/string/substr/

Please let me know if it isn’t the case.

  • Related