I try to count word in huge file. I want to use max of CPU resources and i try to split input data and count words in threads. But i have a problem, when i split data it can split the words and in the end i have wrong answer. How can i split data from file to avoid spliting words? Can somebody help me?
#include <iostream>
#include <fstream>
#include <set>
#include <string>
#include <thread>
#include <mutex>
#include <sstream>
#include <vector>
#include <algorithm>
#define BUFER_SIZE 1024
using namespace std;
std::mutex mtx;
void worker(int n, set<std::string> &mySet, std::string path)
{
mtx.lock();
ifstream file (path, ios::in);
if (file.is_open())
{
char *memblock = new char [BUFER_SIZE];
file.seekg (n * (BUFER_SIZE - 1), ios::beg);
file.read(memblock, BUFER_SIZE - 1);
std::string blockString(memblock);
std::string buf;
stringstream stream(blockString);
while(stream >> buf) mySet.insert(buf);
memblock[BUFER_SIZE] = '\0';
file.close();
delete[] memblock;
}
else
cout << "Unable to open file";
mtx.unlock();
}
int main(int argc, char *argv[])
{
set<std::string> uniqWords;
int threadCount = 0;
ifstream file(argv[1], ios::in);
if(!file){
std::cout << "Bad path.\n";
return 1;
}
file.seekg(0, ios::end);
int fileSize = file.tellg();
file.close();
std::cout << "Size of the file is" << " " << fileSize << " " << "bytes\n";
threadCount = fileSize/BUFER_SIZE 1;
std::cout << "Thread count: " << threadCount << std::endl;
std::vector<std::thread> vec;
for(int i=0; i < threadCount; i )
{
vec.push_back(std::thread(worker, i, std::ref(uniqWords), argv[1]));
}
std::for_each(vec.begin(), vec.end(), [](std::thread& th)
{
th.join();
});
std::cout << "Count: " << uniqWords.size() << std::endl;
return 0;
}
CodePudding user response:
The problem right now is that you're reading in a fixed-size chunk, and processing it. If that stops mid-word, you're doing to count the word twice, once in each buffer it was placed into.
One obvious solution is to only break between buffers at word boundaries--e.g., when you read in one chunk, look backwards from the end to find a word boundary, then have the thread only process up to that boundary.
Another solution that's a bit less obvious (but may have the potential for better performance) is to look at the last character in a chunk to see if it's a word character (e.g., a letter) or a boundary character (e.g., a space). Then when you create and process the next chunk, you tell it whether the previous buffer ended with a boundary or within a word. If it ended within a word, the counting thread knows to ignore the first partial word in the buffer.