Home > front end >  Merging 1000s of files into one sorted file with memory constraints - heap vs bucket sort
Merging 1000s of files into one sorted file with memory constraints - heap vs bucket sort

Time:10-20

The numbers are unique within each file and across all files. Here is one popular approach (external sorting) to merge n files into one sorted file when our compute machine cannot store all the data:

  • Sort each smaller file
  • Read first elements from each file
  • Create a heap
  • while heap is not empty:
    • write min element to final file
    • insert next element from the smaller file (whose element was written to the main file in step above) into the heap

While I do understand how this algorithm helps in memory usage - there's a lot of I/O overhead involved.

Another possible approach uses bucket sort - specially if numbers are uniformly distributed:

  • Create n buckets - (max_num-min_number)/number_of_elements_in_1_bucket
  • For each smaller file:
    • read file
    • create a batch update dictionary {bucket_number:[list of elements]}
    • Append data from above dictionary to respective bucket files
    • The dictionary space is limited so only 100 elements at a time can be processed for each smaller file. So above 2 steps are repeated until we have exhausted the current file elements
  • For each bucket file created
    • sort data in each bucket
  • append all files together into a single sorted file

Now, I believe the latter approach can be slightly faster or at least equivalent in terms of time complexity, but uses more storage space for temp files. Are these 2 approaches comparable in time complexity or 1st one is always better even after multiple I/O calls? Am I missing some time-consuming step in the 2nd approach that makes it less optimum?

CodePudding user response:

My understanding is that it is the other way around. If numbers have a predictable distribution in a predictable range, bucket sort is faster. (In the limit O(n) vs O(n log(n)) for comparison based sorts.) Better yet, buckets can be sorted independently from each other. Which makes them perfect for a distributed sort.

  • Related