Handle very large data in C-CodePudding

I have a .csv file which has ~3GB of data. I want to read all that data and process it. The following program reads the data from a file and stores it into a std::vector<std::vector<std::string>>. However, the program runs for too long and the application (vscode) freezes and needs to be restarted. What have I done wrong?

#include <algorithm>
#include <iostream>
#include <fstream>
#include "sheet.hpp"

extern std::vector<std::string> split(const std::string& str, const std::string& delim);

int main() {
    Sheet sheet;
    std::ifstream inputFile;
    inputFile.open("C:/Users/1032359/cpp-projects/Straggler Job Analyzer/src/part-00001-of-00500.csv");
    std::string line;
    while(inputFile >> line) {
        sheet.addRow(split(line, ","));
    }
    return 0;
}

// split and Sheet's member functions have been tested thoroughly and work fine. split has a complexity of N^2 though...

EDIT1: The file read has been fixed as per the suggestions in the comments.

The Split function:

std::vector<std::string> split(const std::string& str, const std::string& delim) {
    std::vector<std::string> vec_of_tokens;
    std::string token;
    for (auto character : str) {
        if (std::find(delim.begin(), delim.end(), character) != delim.end()) {
            vec_of_tokens.push_back(token);
            token = "";
            continue;
        }
        token  = character;
    }
    vec_of_tokens.push_back(token);
    return vec_of_tokens;
}

EDIT2: dummy csv row: 5612000000,5700000000,4665712499,798,3349189123,0.02698,0.06714,0.07715,0.004219,0.004868,0.06726,7.915e-05,0.0003681,0.27,0.00293,3.285,0.008261,0,0,0.01608

limits:  
field1: starting timestamp (nanosecs)
field2: ending timestamp (nanosecs)
field3: job id (<= 1,000,000)
field4: task id (<= 10,000)
field5: machine id (<= 30,000,000)
field6: CPU time (sorry, no clue)
field7-20: no idea, unused for the current stage, but needed for later stages.

EDIT3: Required Output

remember the .thenby function in Excel?
the sorting order here is sort first on 5th column (1-based indexing), then on 3rd column and lastly on 4th column; all ascending.

CodePudding user response：

I would start by defining a class to carry the information about one record and add overloads for operator>> and operator<< to help reading/writing records from/to streams. I'd probably add a helper to deal with the comma delimiter too.

First, the set of headers I've used:

#include <algorithm>   // sort
#include <array>       // array
#include <cstdint>     // integer types
#include <filesystem>  // filesystem
#include <fstream>     // ifstream
#include <iostream>    // cout
#include <iterator>    // istream_iterator
#include <tuple>       // tie
#include <vector>      // vector

A simple delimiter helper could look like below. It discards (ignore()) the delimiter if it's in the stream or sets the failbit on the stream if the delimiter is not there.

template <char Char> struct delimiter {};

template <char Char> // read a delimiter
std::istream& operator>>(std::istream& is, const delimiter<Char>) {
    if (is.peek() == Char) is.ignore();
    else is.setstate(std::ios::failbit);
    return is;
}

template <char Char> // write a delimiter
std::ostream& operator<<(std::ostream& os, const delimiter<Char>) {
    return os.put(Char);
}

The actual record class can, with the information you've supplied, look like this:

struct record {
    uint64_t start;       // ns
    uint64_t end;         // ns
    uint32_t job_id;      // [0,1000000]
    uint16_t task_id;     // [0,10000]
    uint32_t machine_id;  // [0,30000000]
    double cpu_time;
    std::array<double, 20 - 6> unknown;
};

Reading such a record from a stream can then be done like this, using the delimiter class template (instantiated to use a comma and newline as delimiters):

std::istream& operator>>(std::istream& is, record& r) {
    delimiter<','> del;
    delimiter<'\n'> nl;

    // first read the named fields
    if (is >> r.start >> del >> r.end >> del >> r.job_id >> del >>
        r.task_id >> del >> r.machine_id >> del >> r.cpu_time)
    {
        // then read the unnamed fields:
        for (auto& unk : r.unknown) is >> del >> unk;
    }
    return is >> nl;
}

Writing a record is similarly done by:

std::ostream& operator<<(std::ostream& os, const record& r) {
    delimiter<','> del;
    delimiter<'\n'> nl;

    os << 
        r.start << del <<
        r.end << del <<
        r.job_id << del <<
        r.task_id << del <<
        r.machine_id << del <<
        r.cpu_time;
    for(auto&& unk : r.unknown) os << del << unk;
    return os << nl;
}

Reading the whole file into memory, sorting it and then printing the result:

int main() {
    std::filesystem::path filename = "C:/Users/1032359/cpp-projects/"
                                     "Straggler Job Analyzer/src/part-00001-of-00500.csv";
    
    std::vector<record> records;

    // Reserve space for "3GB" / 158 (the length of a record   some extra bytes)
    // records. Increase the 160 below if your records are actually longer on average:
    records.reserve(std::filesystem::file_size(filename) / 160);

    // open the file
    std::ifstream inputFile(filename);

    // copy everything from the file into `records`
    std::copy(std::istream_iterator<record>(inputFile),
              std::istream_iterator<record>{},
              std::back_inserter(records));

    // sort on columns 5-3-4 (ascending)
    auto sorter = [](const record& lhs, const record& rhs) {
        return std::tie(lhs.machine_id, lhs.job_id, lhs.task_id) <
               std::tie(rhs.machine_id, rhs.job_id, rhs.task_id);
    };
    std::sort(records.begin(), records.end(), sorter);

    // display the result
    for(auto& r : records) std::cout << r;
}

The above process takes ~2 minutes on my old computer with spinning disks. If this is too slow, I'd measure the time of the long running parts:

reserve
copy
sort

Then, you can probably use that information to try to figure out where you need to improve it. For example, if sorting is a bit slow, it could help to use a std::vector<double> instead of a std::array<double, 20-6> to store the unnamed fields:

struct record {
    record() : unknown(20-6) {} 

    uint64_t start;       // ns
    uint64_t end;         // ns
    uint32_t job_id;      // [0,1000000]
    uint16_t task_id;     // [0,10000]
    uint32_t machine_id;  // [0,30000000]
    double cpu_time;
    std::vector<double> unknown;
};

CodePudding user response：

As an alternate way to solve this problem, I would suggest to not read all data in memory, but to use the minimum amount of RAM to sort the huge CSV file: a std::vector of line offsets.

The following code is not yet tested. Feel free to fix it where needed. The important thing is to understand the concept, not the precise implementation.

As the implementation only needs 8 bytes per line (in 64-bit mode), to sort the 3 GB data file, we only need roughly 150 MB of RAM. The drawback is that the parsing of numbers need to be done several times for the same line, roughly log2(17e6)= 24 times. However, I think that this overhead is partially compensated by the less memory used and no need to parse all numbers of the row.

#include <Windows.h>
#include <cstdint>
#include <vector>
#include <algorithm>
#include <array>
#include <fstream>

std::array<uint64_t, 5> readFirst5Numbers(const char* line)
{
    std::array<uint64_t, 5> nbr;
    for (int i = 0; i < 5; i  )
    {
        nbr[i] = atoll(line);
        line = strchr(line, ',')   1;
    }
    return nbr;
}

int main()
{
    // 1. Map the input file in memory
    const char* inputPath = "C:/Users/1032359/cpp-projects/Straggler Job Analyzer/src/part-00001-of-00500.csv";
    HANDLE fileHandle = CreateFileA(inputPath, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
    DWORD highsize;
    DWORD lowsize = GetFileSize(fileHandle, &highsize);
    HANDLE mappingHandle = CreateFileMapping(fileHandle, NULL, PAGE_READONLY, highsize, lowsize, NULL);
    size_t fileSize = (size_t)lowsize | (size_t)highsize << 32;
    const char* memoryAddr = (const char*)MapViewOfFile(mappingHandle, FILE_MAP_READ, 0, 0, fileSize);

    // 2. Find the offset of the start of lines
    std::vector<size_t> linesOffset;
    linesOffset.push_back(0);
    for (size_t i = 0; i < fileSize; i  )
        if (memoryAddr[i] == '\n')
            linesOffset.push_back(i   1);
    linesOffset.pop_back();

    // 3. sort the offset according to some logic
    std::sort(linesOffset.begin(), linesOffset.end(), [memoryAddr](const size_t& offset1, const size_t& offset2) {
        std::array<uint64_t, 5> nbr1 = readFirst5Numbers(memoryAddr   offset1);
        std::array<uint64_t, 5> nbr2 = readFirst5Numbers(memoryAddr   offset2);
        if (nbr1[4] != nbr2[4])
            return nbr1[4] < nbr2[4];
        if (nbr1[2] != nbr2[2])
            return nbr1[2] < nbr2[2];
        return nbr1[4] < nbr2[4];
        });

    // 4. output sorted array
    const char* outputPath = "C:/Users/1032359/cpp-projects/Straggler Job Analyzer/output/part-00001-of-00500.csv";
    std::ofstream outputFile;
    outputFile.open(outputPath);
    for (size_t offset : linesOffset)
    {
        const char* line = memoryAddr   offset;
        size_t len = strchr(line, '\n')   1 - line;
        outputFile.write(line, len);
    }
}

CodePudding user response：

I would suggest a slightly different approach:

Do NOT parse the entire row, only extract fields that are used for sorting
Note that your stated ranges require small number of bits, that together fit in one 64-bit value:

30,000,000 - 25 bit

10,000 - 14 bit

1,000,000 - 20 bit

Save a "raw" source in your vector, so that you can write it out as needed.

Here is what I got:

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <chrono>
#include <algorithm>

struct Record {
    uint64_t key;
    std::string str;
    Record(uint64_t key, std::string&& str)
        : key(key)
        , str(std::move(str))
    {}
};

int main()
{
    auto t1 = std::chrono::high_resolution_clock::now();
    std::ifstream src("data.csv");
    std::vector<Record> v;
    std::string str;
    uint64_t key(0);
    while (src >> str)
    {
        size_t pos = str.find(',')   1;
        pos = str.find(',', pos)   1;
        char* p(nullptr);
        uint64_t f3 = strtoull(&str[pos], &p, 10);
        uint64_t f4 = strtoull(  p, &p, 10);
        uint64_t f5 = strtoull(  p, &p, 10);
        key  = f5 << 34;
        key |= f3 << 14;
        key |= f4;
        v.emplace_back(key, std::move(str));
    }
    std::sort(v.begin(), v.end(), [](const Record& a, const Record& b) {
        return a.key < b.key;
        });
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << std::endl;

    std::ofstream out("out.csv");
    for (const auto& r : v) {
        out << r.str << std::endl;
    }
    auto t3 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count() << std::endl;
}

Of course, you can reserve space in your vector upfront to avoid reallocation.

I've generated a file with 18,000,000 records. My timing shows ~30 second for reading / sorting the file, and ~200 seconds to write the output.