Sort the digits of a 1GB file containing a single number efficiently-CodePudding

I'm trying to print in ascending order a 1GB file containing a randomly generated big number. This is the code that I'm using to generate the random number for my test (found it here).

import random
import math

afile = open("./Random.txt", "w" )

digits=1000000000
finalNumber = ""
for i in range(digits // 16):
  finalNumber = finalNumber   str(math.floor(random.random() * 10000000000000000))
finalNumber = finalNumber   str(math.floor(random.random() * (10 ** (digits % 16))))

afile.write(finalNumber)
afile.close()

The following python code works OK and takes a bit less than 4 minutes. But I was told this can be accomplished in about 15 seconds and that I might not even need to sort the number at all for printing it in order which is kind of puzzling.

This is not a homework but a question I was asked in a job interview that i didn't manage to solve and is killing me not knowing the solution. I was not asked to do this on any specific language but I decided to use python because I'm familiar with the language. I gave a quick test of using bash but it was already struggling my script with a 10MB file.

# Sort integers in ascending order
import sys
import os
import shutil

# Count for the chunks
count = 0

# Generate 100MB chunks
def read_in_chunks(file_object, chunk_size=1024*102400):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

#Do a mergesort of the chunks.
def merge_files(outfile):
    words = []
    for f in os.listdir('./tmp/'):
        if os.path.isfile('./tmp/'   f):
            file_ = open('./tmp/'   f)
            words.append(file_.readlines())

    # Sort in-memory         
    words.sort()  

    with open(outfile, 'w') as out:
        out.write(str(words))

with open(sys.argv[1]) as line:
    #If tmp folder not exist create it
    if os.path.exists('./tmp/'):
       shutil.rmtree('./tmp/')
    os.mkdir('./tmp/')
    for chunk in read_in_chunks(line):
         
        # Sort each chunk
        ascending = "".join(sorted(str(chunk)))
        #write chunk to disk
        with open(f"./tmp/out_{count}.txt", mode="w") as fw:
                fw.writelines(ascending)
        count  = 1 
    #merge all chunks into a single file with mergesort    
    merge_files('finalout.txt')
    shutil.rmtree('./tmp/')

This basically chunks the file on temp files of 100MB, sort each chunk and then do a merge sort to append them. Just running a sort on the file will end up in a "MemoryError"

I also tried to just read the file once with a for and do a series of if/else to append each value to 10 different variables and then print them in order from 0 to 10 but this is very inefficient and slower than my initial method.

Clearly there needs to be a "trick" to solve this problem.

CodePudding user response：

The possible values of the digits of a number are very limited - the 10 digits (0 to 9). For that reason this problem is perfect candidate to use counting sort. Counting sort has complexity O(n) and this is asymptotically faster than any direct comparison sort (like merge sort). Also this is in fact the best complexity you can achieve as at the very least you need to read the number (which is already O(n) where n is the number of digits).

CodePudding user response：

As everyone indicates, the expected answer is a counting sort.

It takes a little extra effort, though, to make a python-implemented counting sort beat the built-in string.sort(), which is written in C . It's especially important to avoid creating a new python string object for each character of data.

One solution is to use the built-in string.sort(), followed by 10 calls to string.index() to get the counts for each chunk.

I decided to use 10 calls to string.count(). Here's the implementation:

from collections import defaultdict
counts=defaultdict(int)

with open("./Random.txt") as infile:
  while True:
    data = infile.read(1000000)
    if not data:
      break
    for digit in "0123456789":
      counts[digit] = counts[digit]   data.count(digit)

with open("./fastout.txt", mode="w") as outfile:
  for digit in "0123456789":
    count = counts[digit]
    while count > 1000000:
      outfile.write(digit*1000000)
      count -= 1000000
    if count > 0:
      outfile.write(digit*count)

Your original results:

$ time python3 original.py

real    3m22.689s
user    3m10.143s
sys 0m9.797s

My results:

$ time python3 new.py

real    0m14.001s
user    0m13.297s
sys 0m0.471s

I also noticed that your output file is a little longer than the input file, so you have a bug in there somewhere that I didn't bother finding.