For my project, I have to search through a data set to find a string that matches. Previously, it was implemented by comparing every single item with the result string, but now my team wants it to run faster. I cannot use a hash map because we are searching through multiple datasets to find a string so I need an alternative. Please help, thank you!
//Looking for a specific string (time) in data set
foreach(var dataset in datasets)
{
if(dataset.Tables[0].Rows[i].ItemArray[0].ToString() == time.ToString())
{
//Enter this block
}
}
CodePudding user response:
Use bitmaps. For example this one here - https://github.com/Auralytical/CRoaring.Net
- Split your files into tokens, normalize them (to lower, trim, etc)
- Assign index to each unique token (plain list, for example, any new token is just appended at the end)
- If file contain token - set bit at its index.
- You will get compact indexes for each file (sparse bitmap).
All you have to do now is to make the same with your query to get query bitmap, then just scan through all bitmaps to find which files are matched. You can make binary search tree to make it logarithmic.
pros
is that final complexity will be O(N) for indexing, O(log(N)) for finding any query you desire. N - count of documents.
cons
of this approach is if your tokens are random (hashes, md5, telephone numbers, datetimes, etc) - it will work very slow. Because domain is practically infinite and you should ignore those tokens and put them in some kind of offloading tree (or just into key-value database).
PS
This is how most of FullText search engines do it, so you can just straight use ElasticSearch or Lucene shard (which I recommend you to use, because it is simple and very flexible engine, and it is production ready which means there is gazilion of people who use this for most perverted business logic, like building their own search engines or porn story catalogs)