Is it possible to efficiently get a subset of rows from a large fixed-width CSV file?-CodePudding

I have an extremely large fixed-width CSV file (1.3 million rows and 80K columns). It's about 230 GB in size. I need to be able to fetch a subset of those rows. I have a vector of row indices that I need. However, I need to now figure out how to traverse such a massive file to get them.

The way I understand it, C will go through the file line by line, until it hits the newline (or a given delimiter), at which point, it'll clear the buffer, and then move onto the next line. I have also heard of a seek() function that can go to a given position in a stream. So is it possible to use this function somehow to get the pointer to the correct line number quickly?

I figured that since the program doesn't have to basically run billions of if statements to check for newlines, it might improve the speed if I simply tell the program where to go in the fixed-width file. But I have no idea how to do that.

Let's say that my file has a width of n characters and my line numbers are {l_1, l_2, l_3, ... l_m} (where l_1 < l_2 < l_3, ... < l_m). In that case, I can simply tell the file pointer to go to (l_1 - 1) * n, right? But then for the next line, do I calculate the next jump from the end of the l_1 line or from the beginning of the next line? And should I include the newlines when calculating the jumps?

Will this even help improve speed, or am I just misunderstanding something here?

Thanks for taking the time to help

EDIT: The file will look like this:

id0000001,AB,AB,AA,--,BB
id0000002,AA,--,AB,--,BB
id0000003,AA,AA,--,--,BB
id0000004,AB,AB,AA,AB,BB

CodePudding user response：

As I proposed in the comment, you can compress your data field to two bits:

-- 00
AA 01
AB 10
BB 11

That cuts your file size 12 times, so it'll be ~20GB. Considering that your processing is likely IO-bound, you may speed up processing by the same 12 times.

The resulting file will have a record length of 20,000 bytes, so it will be easy to calculate an offset to any given record. No new line symbols to consider :)

Here is how I build that binary file:

#include <fstream>
#include <iostream>
#include <string>
#include <chrono>

int main()
{
      auto t1 = std::chrono::high_resolution_clock::now();
      std::ifstream src("data.txt");
      std::ofstream bin("data.bin", std::ios::binary | std::ios::out);
      size_t length = 80'000 * 3   9   1;
      std::string str(length, '\0');
      while (src.read(&str[0], length))
      {
        size_t pos = str.find(',')   1;
        for (int group = 0; group < 2500;   group) {
            uint64_t compressed(0), field(0);
            for (int i = 0; i < 32;   i, pos  = 3) {
                if (str[pos] == '-')
                    field = 0;
                else if (str[pos] == 'B')
                    field = 3;
                else if (str[pos   1] == 'B')
                    field = 2;
                else
                    field = 1;

                compressed <<= 2;
                compressed |= field;
            }
            bin.write(reinterpret_cast<char*>(&compressed), sizeof compressed);
        }
      }
      auto t2 = std::chrono::high_resolution_clock::now();
      std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;    return 0;
}

This can encode about 1,600 records in a second, so the whole file will take ~15 minutes. How long does it take you now to process it?

CodePudding user response：

The seek family of functions in <iostream> classes are generally byte-oriented. You can use them, iff you're absolutely confident that your records(lines in this case) have fixed count of bytes; in that case, instead of getline, you can open the file as binary and use .read that can read the specified number of bytes into a byte array of enough capacity. But - because the file is storing text after all - in case even one single record has a different size, you`ll get out of alignment; if the id field is guaranteed to equal the line number - or at-least an increasing mapping of it -an educated guess and a follow-up trial & error can help. You need to switch to some better database management fast; even a 10GB single binary file is too large and prone to fast corruption. You may consider chopping it into much smaller slices(order of 100MB maybe) so as to minimize the chance for damage propagation. Plus you gotta need some redundancy mechanism for recovery/correction.