Home > Back-end >  Optimizing Calculation to Get Length of Longest Sequence of Zero Bytes in Binary File
Optimizing Calculation to Get Length of Longest Sequence of Zero Bytes in Binary File

Time:08-11

I need to calculate the length of the longest sequence of zero bytes in a binary file as fast as possible. I have a basic implementation in C below:

#include <iostream>
#include <fstream>
#include <algorithm>
#include <string>

int get_max_zero_streak(std::string fname)
{
  std::ifstream myfile(fname, std::ios_base::binary);
  int length = 0;
  int streak = 0;
  while(myfile) 
  {
    unsigned char x = myfile.get(); // unsigned 8 bit integer
    if(x == 0)
    {
      streak  = 1;
    }
    else
    {
      length = std::max(length, streak);
      streak = 0;
    }
  }
  return length;
}

int main() 
{
  std::cout << get_max_zero_streak("000_c.aep") << std::endl;
  std::cout << get_max_zero_streak("000_g1.aep") << std::endl;
  std::cout << get_max_zero_streak("000_g2.aep") << std::endl;
  std::cout << get_max_zero_streak("001_c.aep") << std::endl;
  std::cout << get_max_zero_streak("001_g1.aep") << std::endl;
  std::cout << get_max_zero_streak("001_g2.aep") << std::endl;
  std::cout << get_max_zero_streak("002_c.aep") << std::endl;
  std::cout << get_max_zero_streak("002_g1.aep") << std::endl;
  std::cout << get_max_zero_streak("002_g2.aep") << std::endl;
  return 0;
}

This is OK on smaller files, but is extraordinarily slow on larger files (such as 50 GB ). Is there a more efficient way to write this, or is parallelizing it my only hope? I'm reading the files from an NVMe SSD, so I don't think the drive read speed is the limitation.

CodePudding user response:

The main bottleneck of this code is the byte-per-byte IO reads that prevent the loop to saturate most SSDs (by a large margin). Indeed, calling myfile.get() for each iteration is slow and compiler do not optimize it in practice. A solution to fix that is to read data chunk-by-chunk and use an inner loop operating on the current chunk. Chunks must be sufficiently large to amortize the cost of IO calls and sufficiently small so data is read from caches (something like a buffer of 32KiB~256KiB should be good enough).

Another big bottleneck is the computing loop itself: conditions are slow on modern processor unless they can be predicted (eg. if most values are 0 or 1 and there is not much unpredictable switches between both). Using a branchless code should help significantly to make the loop faster but the loop-carried dependency is then the main issue. Clang is able to generate a branchless assembly loop but GCC do not succeed to make this optimization. Still, the loop stay serial and scalar so the processor cannot compute bytes faster than its own frequency. In fact, it is much less than that due to the instruction latency and the dependencies. An efficient solution to fix the problem is to operate on multiple bytes at the same time. One solution is to use x86 SIMD instructions (x86-specific and fast). Another solution is to load 8 bytes and operate on large integer using bit-tweaks but it is a bit tricky to do efficiently and I doubt it will be much faster in the end.

Regarding the SIMD solution, the idea is to load data by block of 16 bytes, compare the 16 bytes to 0 simultaneously and set a bit of an integer to 1 or 0 regarding the result of the comparison. You can then check the number of trailing 0 bits very quickly (with a generic code or with compiler built-ins like __builtin_ctz generating a fast x86 instruction). You finally need to shift the result regarding the number of trailing 0 found. This solution is not very fast if there are many 0-1 switches but very fast when there are many large blocks of 0 (or 1). Note that with AVX-2, it is even possible to compute 32 bytes in a row. The SIMD instructions used to do the operation are _mm_load_si128, _mm_cmpeq_epi8 and _mm_movemask_epi8.

Note that you can use a lookup table (LUT) to speed things up when there is a lot of switches. This is not so simple though because the number of 0 of two consecutive masks needs to be added together. The LUT can include the number of previous pending 0 found so to be even faster. The idea is to find the maximum number of consecutive 0 bits in a 8-bit mask (representing 8 bytes) based on the previous number of 0 bytes found.

Another possible optimization is to compute the chunks in parallel. This idea is to reduce a chunk to 3 values: the number of pending 0 bytes, the number of leading 0 bytes and the maximum number of consecutive bytes found between the two. Note that the special case where all bytes are 0 should be considered. These values must be temporary stored and you can perform a final pass at the end of the file that consist in merging the result of all chunks. This operation can be implemented quite simply if you use OpenMP with tasks (one per chunk).

CodePudding user response:

As a starting point, I'd read a large chunk of data at once (or if you're on Linux memory map the file1).

Then note that we don't care about the length of every run of zero bytes. We only care about finding a run longer than the longest we've found yet. We can use that to improve speed considerably.

For example, let's assume that starting from byte 0, we find a run of 26 zero bytes (and the 27th is non-zero).

So, the next possible starting point is the 28th byte. But we don't start counting the length of a run from there. At this point, we care about whether we have a run of at least 27 zeros, starting from byte 28. For that to be the case, all the bytes from 28 to 55 have to be zeros. So we can start from byte 55, and see if it's a zero. Then if (and only if) it's zero, we step backward through the preceding bytes to find the beginning of this string of zeros, compute the location of the earliest end that will give a new longest run starting from that point, and repeat.

But in the (highly likely--255/256 chance, if the data is random) chance that byte 55 is non-zero, we can skip ahead another 27 bytes (to byte 82) and see if it's zero or not. Again, most likely it's not, in which case we skip ahead another 27 bytes.

As the current maximum length grows, the efficiency of this algorithm grows with it. Most of the time we're looking at only one out of every length bytes of the data. If length gets very large, there may easily be entire sectors we never need to read in from secondary storage at all.


  1. In theory you can memory map the file on Windows as well, but at least in my experience, memory mapping tends to be a big win on Linux, but rarely gives much improvement on Windows. If you're using Windows, you're generally better off calling CreateFile directly, and passing FILE_FLAG_NO_BUFFERING. See Optimizing an I/O bound Win32 application for more details.
  • Related