How do I search most common words in very big file (over 1 Gb) wit using 1 Kb or less memory?-CodePudding

I have a problem here, that should be solved by using only pure C.

I have very big text file, with dozens of millions of words, one word per line. I need to find top 10 most common words in that file. There is some restrictions: usage of only C's standard library and usage of less that 1 Kb of memory.

It is guaranteed that any 10 words in that file is short enough to fit into said memory limit and there will be enough memory to some other variables such as counters etc.

The only solution I come with is to use another text file as additional memory and buffer. But it seems to be bad and slow way to real with that problem.

Is there any better solution?

CodePudding user response：

You can first sort this file (it is possible with limited memory, but will require disk IO of course - see How do I sort very large files as starter).

Then you will be able to read sorted file line by line and calculate frequency of each word one by one - store them, after 10 words - if frequency is higher then all stored in your array - add it to internal array and remove least occurred one, thus you will keep only 10 most frequent words in memory during this stage.

As @John Bollinger mentioned - if your requirment is to print all top 10 words, if for example - all words from files have the same frequency, i.e. they all are "top", then this approach will not work, you need to calculate frequency for each word, store in file, sort it and then print top 10 including all words with the same frequency as 10th one.

CodePudding user response：

If you can create a new file however big, you can create a simple disk-based tree database holding each word and its frequency so far. This will cost you O(log n) each time, with n from 1 to N words, plus the final scan of the whole N-sized tree, which adds up to O(N log N).

If you cannot create a new file, you'll need to perform a in-place sorting of the whole file, which will cost about O(N²). I think actually O((N/k)²) with k the average number of words you can keep in memory for the simplest bubble-sort. At that point you can rescan the file one final time, and after each run of any word you'll know whether this word can enter your top ten, and at which position. So you need to fit just twelve words in memory (the top ten, the current word, and the word just read from the file). 1K should be enough.

So, the auxiliary file is actually the fastest option.