I am trying to optimize a text search, where I am searching for multiple words. I want to know the frequency of all the words, per line.
I have tried to make it as fast as I can, as I want to run the search many times, with multiple keywords, on the same data.
I still am thinking though that there should be a more efficient way to solve this, so anybody has some good suggestions?
I have put up a simple demo to show the POC on gitlab: https://gitlab.com/dkruithof/textfind
My current search time is 410ms on 6 keywords in a dataset of 408MB
Also, the source of the demo is this:
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
using namespace std;
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
{
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
{
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
}
else
{
id = it->second;
}
return id;
}
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string& line)
{
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
char str[line.size()];
strncpy(str, line.c_str(), line.size());
// Getting the first token
char *token = strtok(str, newsDelimiters);
while (token != NULL)
{
//finding a word:
unsigned int id = addWord(wordLookup, token);
wordList.push_back(id);
// Getting the next token
// If there are no tokens left, NULL is returned
token = strtok(NULL, newsDelimiters);
}
}
int main()
{
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = {"this", "blog", "political", "debate", "climate", "iphone"};
unsigned int searchLength = searchWords.size();
unsigned int searchWordIds[searchLength];
//convert searchWords
unsigned int i = 0;
for(const std::string& word : searchWords)
{
searchWordIds[i] = addWord(wordLookup, word);
i;
}
//#### This part is not time critical ####
//reading file and convert words to numbers
fstream newsFile;
newsFile.open("news.txt",ios::in);
if (newsFile.is_open())
{
string line;
while(getline(newsFile, line))
{
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
}
newsFile.close();
}
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for(std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
for(unsigned int word : line)
{
for(unsigned int s = 0; s < searchLength; s)
{
unsigned int searchWord = searchWordIds[s];
if(word == searchWord)
{
count;
}
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
cout << elapsed.count() << "ms" << endl;
//#### Print for checking result, time insensitive :)
int n = 0;
for(unsigned int count : counts)
{
cout << "Count[" << n << "]: " << count << endl;
n;
if(n > 100)
{
break;
}
}
}
CodePudding user response:
I'd try building a https://en.wikipedia.org/wiki/Radix_tree that contains all your search words. When processing each line of text you then only need one maintain one pointer into the radix tree for each character position, and need to advance all of them with every additionally consumed character (or remove the pointer of the character sequence can no longer reach a valid word). Whenever an advanced pointer points to the end of a word, you increment your counter.
This shouldn't require any tokenization.
CodePudding user response:
In practice I would perhaps serialize the whole text into std::unordered_map<std::string, int>
. There string
is word and int
is count of that word in text. That operation is about O(X) where X is count of all words in text assuming that individual words are too short for hashing of those to matter. You said it is not time critical ... but just for the record.
After that searching a word in it is O(1) assuming again that the "word" means relatively short string and also we already have count of those words. If you have a list of words to search then it is O(N) where N is length of list.
CodePudding user response:
You do not need to iterate over all the searchWordIds
items. Assuming this array do no contains any duplicates, you can use hash table for that so to make the algorithm runs in O(n²)
time rather than O(n³)
time (thanks to a O(1)
search in searchWordIds
). More specifically, an std::unordered_set<int>
can be used so to check if word
is in searchWordIds
in constant time. You need to convert searchWordIds
to a std::unordered_set<int>
first. If the array has duplicates, then you can use a std::unordered_map<int, int>
so to store the number of duplicates associated to a given word
. The 2 nested loops consist in doing count = searchWordIds[word]
in this last case.
If this is not enthough, you can use a Bloom filter so to speed up the lookup in searchWordIds
. Indeed, this probabilistic data structure can very quickly find if word
is not in searchWordIds
(100% sure) or say if it is certainly in it (with a good accuracy assuming the bloom filter is sufficiently large). This should be at least twice faster. Possibly even more (the unordered_set
and unordered_map
are generally not very efficient, partially due to the use of linked-list-based buckets and a slow hash management).
If this is still not enough, you can parallelize the outermost loop. The idea is to compute a local count
value for each section of the textAsNumbers
array and then perform a final reduction. This assume the size of the sub arrays is relatively uniform (it will not scale well if one line is much much bigger than all others). You can flatten the vector<vector<int>>
so to better load-balance the work and certainly even improve the performance in sequential (due to less indirections and likely less cache misses).