I need to sort large text file that is separated by a newline character.
I need to assume that input data is too big to fit into main memory, meaning that I can only read and store one line of the file in the memory. Therefore, I can't make a list or array for to use in a classic sorting algorithm (like mergesort, quicksort etc.) and because of that I'm stuck.
How could one approach that kind of a problem?
CodePudding user response:
In practice, under normal circumstances, just use the Unix sort command.
LC_ALL=C sort file_in.txt > file_out.txt
For a very large file, I'd go distributed use the sort build into a mapreduce. See How does the MapReduce sort algorithm work? for how that one works. One tip, if you go distributed, stay distributed. That is, the input file, output file, and operation should all be on distributed filesystems so you nowhere have a bottleneck on any one machine.
Exactly once have I faced a situation where these two would not work. The situation was where I needed to sort and organize a dataset that was coming from a database, but the data was too big to sort in the database, and the machine that I was on did not have space for the raw dataset.
I solved that one with a mergesort where all chunks above a certain size were kept in compressed files. The key logic was something like this:
for row in big query:
last_chunk = new chunk from row
chunk = None
while 0 < len(chunks):
chunk = chunks.pop()
if chunk.size < 1.2 * last_chunk.size:
last_chunk = merge(chunk, last_chunk)
else:
break
if chunk is not None:
chunks.append(chunk)
chunks.append(last_chunk)
while 1 < len(chunks):
chunks.append(merge(chunks.pop(), chunks.pop()))
And then in the merge logic I had the reasoning about whether a chunk should wind up in memory or in a compressed file, how to get the rows, and how to write them.
(My problem was not so simple as this, because I was grouping, summing, merging, etc. Basically duplicating a report I couldn't run in the database because it didn't have enough working memory to run it.)
Those three cover every situation I've personally encountered with large files.
Here is a not great but working implementation of an external file-based mergesort.
First you need demo.txt
that we want to sort:
hello
world
greetings
earthlings
And now Python code to print it in sorted order:
from tempfile import TemporaryFile
class FileBuffer():
def __init__ (self):
self.current_line = None
self.fh = TemporaryFile()
self.state = 'writing'
self.size = 0
def add (self, line):
if self.state != 'writing':
raise Exception(f"Cannot write to FileBuffer in state {self.state}")
self.size = len(line)
self.fh.write(bytes(line, encoding='utf-8'))
def finish_writing (self):
self.fh.seek(0, 0)
self.state = 'reading'
self.fetch()
return self.current_line
def fetch (self):
if self.state != 'reading':
raise Exception(f"Cannot read from FileBuffer in state {self.state}")
self.current_line = bytes.decode(self.fh.readline())
if self.current_line == '':
self.current_line = None
self.state = 'done'
return self.current_line
def __iter__(self):
return self
def __next__(self):
line = self.current_line
if line is None:
raise StopIteration
else:
self.fetch()
return line
class BufferedSort():
def __init__ (self):
self.buffers = []
def merge (self):
buffer1 = self.buffers.pop()
buffer2 = self.buffers.pop()
new_buffer = FileBuffer()
while buffer1.current_line is not None and buffer2.current_line is not None:
if buffer1.current_line < buffer2.current_line:
new_buffer.add(buffer1.current_line)
buffer1.fetch()
else:
new_buffer.add(buffer2.current_line)
buffer2.fetch()
while buffer1.current_line is not None:
new_buffer.add(buffer1.current_line)
buffer1.fetch()
while buffer2.current_line is not None:
new_buffer.add(buffer2.current_line)
buffer2.fetch()
new_buffer.finish_writing()
self.buffers.append(new_buffer)
def add (self, line):
buffer = FileBuffer()
buffer.add(line)
buffer.finish_writing()
self.buffers.append(buffer)
while 1 < len(self.buffers) and self.buffers[-2].size < 1.2 * self.buffers[-1].size:
self.merge()
def finish_writing(self):
while 2 < len(self.buffers):
self.merge()
def sorted_buffer(self):
self.finish_writing()
if len(self.buffers):
return self.buffers[0]
else:
buffer = FileBuffer()
buffer.state = 'done'
return buffer
def __iter__(self):
return self.sorted_buffer()
sorter = BufferedSort()
with open("demo.txt") as fh:
for line in fh:
sorter.add(line)
for line in sorter:
print(line, end="")