Home > Mobile >  Python Multiprocessing read and write to large file
Python Multiprocessing read and write to large file

Time:11-09

I have a large file of 120GB consisting of strings line by line. I would like to loop the file line by line replacing all the German characters ß with characters s. I have a working code, but it is very slow, and in the future, I should be replacing more German characters. So I have been trying to cut the file in 6 pieces (for my 6-core CPU ) and incorporate multicore processing to speed the code up, but with no luck.

As lines are not ordered, I do not care where the lines in the new file will end up. Can somebody please help me?

My working slow code:

import re

with open('C:\Projects\orders.txt', 'r') as f, open('C:\Projects\orders_new.txt', 'w') as nf: 
        for l in f:
                l = re.sub("ß", "s", l)
                nf.write(l)

CodePudding user response:

For a multiprocessing solution to be more performant than the equivalent single-processing one, the worker function must be sufficiently CPU-intensive such that running the function in parallel saves enough time to compensate for the additional overhead that multiprocessing incurs.

To make the worker function sufficiently CPU-intensive, I would batch up the lines to be translated into chunks so that each invocation of the worker function involves more CPU. You can play around with the CHUNK_SIZE value (read the comments that precedes its definition). If you have sufficient memory, the larger the better.

from multiprocessing import Pool

def get_chunks():
    # If you have N processors,
    # then we need memory to hold 2 * (N - 1) chunks (one processor
    # is reserved for the main process).
    # The size of a chunk is CHUNK_SIZE * average-line-length.
    # If the average line length were 100, then a chunk would require
    # approximately 1_000_000 bytes of memory.
    # So if you had, for example, a 16MB machine with 8 processors,
    # you would have more
    # than enough memory for this CHUNK_SIZE.
    CHUNK_SIZE = 1_000

    with open('C:\Projects\orders.txt', 'r', encoding='utf-8') as f:
        chunk = []
        while True:
            line = f.readline()
            if line == '': # end of file
                break
            chunk.append(line)
            if len(chunk) == CHUNK_SIZE:
                yield chunk
                chunk = []
        if chunk:
            yield chunk

def worker(chunk):
    # This function must be sufficiently CPU-intensive
    # to justify multiprocessing.
    for idx in range(len(chunk)):
        chunk[idx] = chunk[idx].replace("ß", "s")
    return chunk

def main():
    with Pool(multiprocessing.cpu_count() - 1) as pool, \
    open('C:\Projects\orders_new.txt', 'w', encoding='utf-8') as nf:
        for chunk in pool.imap_unordered(worker, get_chunks()):
            nf.write(''.join(chunk))
            """
            Or to be more memory efficient, but slower:
            for line in chunk:
                nf.write(chunk)
            """

if __name__ == '__main__':
    main()
  • Related