Home > Software engineering >  Fastest way to read / write text file, excluding specific string
Fastest way to read / write text file, excluding specific string

Time:10-09

I am writing a program which reads large (10Gb ) text files, structured in chunks, like this:

@Some_header
ATCCTTTATTCGGTATCGGATATATTACGCGCGGGGGATATCGGGG
 
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:::::::::
@Some_header unfixable_error
ATTTATTTAGAGGAGACTTTTATTTACCCCCCCCGGGGGGATTTTA
 
FFFFFFF:::::::::::::::FFFFFFFFFFUUUUUUUFFUUFUU
@Some_header
ATTATTCCCCTTTTTATACCGGGGGGAAATTAGGGGGGGCCCCTTT
 
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

A chunk consists of the @header, the ATCG sequence, the ' ', and then another string with the same length as the ATCG sequence. Some @header lines have 'unfixable_error' just before the newline. My program must read through these files and write all chunks, except for those with a @header unfixable_error, to a new file.

Currently, my approach is to utilize 'getline()', like so:

  std::ifstream inFile(inFileStr);
  std::ofstream outFile(outFileStr);

  std::string currLine;    
  
  while (getline(inFile, currLine)) {
    if (currLine == " " || currLine.substr(currLine.length()-5, 5) != "error") {
      outFile << currLine << std::endl;
    }   
    else {
      for (int i = 0; i < 3; i  ) {
        getline(inFile, currLine);
      }   
    }   
  }
  inFile.close();
  outFile.close();

I'm certain there's a better solution to this, however. What is the fastest feasible way to accomplish this?

CodePudding user response:

Here is few points:

  • substr creates a new string which is quite expensive for a simple comparison. You can use string views since C 17 to avoid new strings to be created. An alternative solution is to use compare with a position and size. Since C 20, there is also ends_with which is simpler here.
  • std::endl flushes the output which is inefficient. Please consider just using '\n' instead.
  • getline tends to be a bit slow in practice. You can read big chunks and parse it yourself while avoiding copies as much as possible. Writting chunks is more efficient too. The chunks needs not to be too big so to fit in the caches of the CPU (the RAM is slow compared to caches). For example, skipping lines with getline is not efficient since it copies data in memory. With chunks, you can directly search for the next three \n without any write. This operation can be easily vectorized using SIMD instruction so it can be very fast (compilers should be able to do that for you).
  • Pre-reserving some space for currLine might result in a small speed up.
  • One can try to parallelize the algorithm but it certainly does not worth it since the processing should be IO bound (unless if the files are cached or you use a high-performance Nvme SSD) and it is not easy.
  • Related