Home > OS >  Dynamically segment large text files and pass in multithreaded function?
Dynamically segment large text files and pass in multithreaded function?

Time:03-29

I am reading a large file (over 5GBs). I have a function processFunc that parses this file and gives me count and frequency.

I don't want to repeat myself like shown. Is there any logical way I can dynamically set the thread count to read the data in segments? How would a function like that look like?

I have tried looking into threadpool but I don't want to use external libs or binaries in this and I couldn't find any CPP docs for inbuilt threadpool.

    uint threads = 8;
    uint seg = file.length() / threads;

    string dataSeg1 = file.substr(0, seg);
    string dataSeg2 = file.substr(seg, seg);
    string dataSeg3 = file.substr(seg * 2, seg);
    string dataSeg4 = file.substr(seg * 3, seg);
    string dataSeg5 = file.substr(seg * 4, seg);
    string dataSeg6 = file.substr(seg * 5, seg);
    string dataSeg7 = file.substr(seg * 6, seg);
    string dataSeg8 = file.substr(seg * 7, -1);

    std::thread first(processFunc, dataSeg1);
    std::thread second(processFunc, dataSeg2);
    std::thread third(processFunc, dataSeg3);
    std::thread fourth(processFunc, dataSeg4);
    std::thread fifth(processFunc, dataSeg5);
    std::thread sixth(processFunc, dataSeg6);
    std::thread seventh(processFunc, dataSeg);
    std::thread eighth(processFunc, dataSeg8);

    first.join();
    second.join();
    third.join();
    fourth.join();
    fifth.join();
    sixth.join();
    seventh.join();
    eighth.join();

Any suggestions would be helpful.

CodePudding user response:

In your case a combination of std::vector and a loop:

std::vector<std::thread> threads;

size_t numThreads = 7; // not 8, see below!

for(size_t i = 0; i < numThreads;   i)
{
    threads.emplace_back(processFunc, file.substr(i*seg, seg));
}
// now why should *this* thread just sleep waiting?
// you can let it do one of the tasks instead:
processFunc(file.substr(seg*numThreads, -1));
// (side note: avoids special handling for the differing second
//  argument in the loop as well...)

// now when competed itself, it waits for the others:
for(auto& t : threads)
{
   t.join();
}

Et voilà...

Alternatively you can just detach the threads in the loop, then you don't need to join them and thus don't need the vector, however you need to make sure the main thread runs as long as the last of the child threads, so you still need some means of synchronisation to get a complete result. But in situations where you don't rely on any completion of child threads, i.e. it is fine for them to get aborted while running, then detaching can be fine.

Side note: If you don't rely on copies of the original string (which the sub-strings are – well, partial ones), you could instead provide the function with begin and end iterators for the ranges to operate on. That would, if need be, allow to change the original file content directly, if that's desired (however not resize it!):

auto begin = file.begin(); // cbegin, if you want/need const_iterators
for(...)
{
    auto end = begin   seg;
    threads.emplace_back(processFunc, begin, end);
    begin = end;
}
processFunc(begin, file.end()); // or cend, see above

CodePudding user response:

If you have C 17 then you probably want to use std::string_view so you do not have full copies of the individual segments. Instead, you just reference the appropriate sections:

#include <future>
#include <iostream>
#include <string>
#include <string_view>
#include <vector>

const std::string file{"12345678"};

std::string_view processFunc(std::string_view segment)
{
    return segment;
}

int main()
{
    const int threads = 8;
    const size_t seg = file.length() / threads;

    std::vector<std::future<std::string_view>> fragments{threads};

    for(size_t index = 0; index < fragments.size();   index)
    {
        fragments[index] = std::async(
            std::launch::async, 
            [&file, index, seg]()
            {
                const size_t offset{seg * index};
                std::string_view data{file.data()   offset, seg};

                return processFunc(data);
            });
    }

    for(auto& fragment : fragments)
    {
        std::cout << fragment.get() << '\n';
    }

    return 0;
}
  • Related