How to count words in huge file by spliting data?-CodePudding

I try to count word in huge file. I want to use max of CPU resources and i try to split input data and count words in threads. But i have a problem, when i split data it can split the words and in the end i have wrong answer. How can i split data from file to avoid spliting words? Can somebody help me?

#include <iostream>
#include <fstream>
#include <set>
#include <string>
#include <thread>
#include <mutex>
#include <sstream>
#include <vector>
#include <algorithm>
#define BUFER_SIZE  1024

using namespace std;
std::mutex mtx;

void worker(int n, set<std::string> &mySet, std::string path)
{
    mtx.lock();
    ifstream file (path, ios::in);
    if (file.is_open())
    {
        char *memblock = new char [BUFER_SIZE];
        file.seekg (n * (BUFER_SIZE - 1), ios::beg);
        file.read(memblock, BUFER_SIZE - 1);
        std::string blockString(memblock);
        std::string buf;
        stringstream stream(blockString);
        while(stream >> buf) mySet.insert(buf);
        memblock[BUFER_SIZE] = '\0';
        file.close();
        delete[] memblock;
    }
    else 
        cout << "Unable to open file";
    mtx.unlock();
}

int main(int argc, char *argv[])
{
    set<std::string> uniqWords;
    int threadCount = 0;
    ifstream file(argv[1], ios::in);
    if(!file){
        std::cout << "Bad path.\n";
        return 1;
    }
    file.seekg(0, ios::end);
    int fileSize = file.tellg();
    file.close();
    std::cout << "Size of the file is" << " " << fileSize << " " << "bytes\n";
    threadCount = fileSize/BUFER_SIZE   1;
    std::cout << "Thread count: " << threadCount << std::endl;
    std::vector<std::thread> vec;
    for(int i=0; i < threadCount; i  )
    {
        vec.push_back(std::thread(worker, i, std::ref(uniqWords), argv[1]));
    }
    std::for_each(vec.begin(), vec.end(), [](std::thread& th)
    {
        th.join();
    });
    std::cout << "Count: " << uniqWords.size() << std::endl;
    return 0;
}

CodePudding user response：

The problem right now is that you're reading in a fixed-size chunk, and processing it. If that stops mid-word, you're doing to count the word twice, once in each buffer it was placed into.

One obvious solution is to only break between buffers at word boundaries--e.g., when you read in one chunk, look backwards from the end to find a word boundary, then have the thread only process up to that boundary.

Another solution that's a bit less obvious (but may have the potential for better performance) is to look at the last character in a chunk to see if it's a word character (e.g., a letter) or a boundary character (e.g., a space). Then when you create and process the next chunk, you tell it whether the previous buffer ended with a boundary or within a word. If it ended within a word, the counting thread knows to ignore the first partial word in the buffer.