I have a large file of 120GB consisting of strings line by line. I would like to loop the file line by line replacing all the German characters ß with characters s. I have a working code, but it is very slow, and in the future, I should be replacing more German characters. So I have been trying to cut the file in 6 pieces (for my 6-core CPU ) and incorporate multicore processing to speed the code up, but with no luck.
As lines are not ordered, I do not care where the lines in the new file will end up. Can somebody please help me?
My working slow code:
import re
with open('C:\Projects\orders.txt', 'r') as f, open('C:\Projects\orders_new.txt', 'w') as nf:
for l in f:
l = re.sub("ß", "s", l)
nf.write(l)
CodePudding user response:
For a multiprocessing solution to be more performant than the equivalent single-processing one, the worker function must be sufficiently CPU-intensive such that running the function in parallel saves enough time to compensate for the additional overhead that multiprocessing incurs.
To make the worker function sufficiently CPU-intensive, I would batch up the lines to be translated into chunks so that each invocation of the worker function involves more CPU. You can play around with the CHUNK_SIZE
value (read the comments that precedes its definition). If you have sufficient memory, the larger the better.
from multiprocessing import Pool
def get_chunks():
# If you have N processors,
# then we need memory to hold 2 * (N - 1) chunks (one processor
# is reserved for the main process).
# The size of a chunk is CHUNK_SIZE * average-line-length.
# If the average line length were 100, then a chunk would require
# approximately 1_000_000 bytes of memory.
# So if you had, for example, a 16MB machine with 8 processors,
# you would have more
# than enough memory for this CHUNK_SIZE.
CHUNK_SIZE = 1_000
with open('C:\Projects\orders.txt', 'r', encoding='utf-8') as f:
chunk = []
while True:
line = f.readline()
if line == '': # end of file
break
chunk.append(line)
if len(chunk) == CHUNK_SIZE:
yield chunk
chunk = []
if chunk:
yield chunk
def worker(chunk):
# This function must be sufficiently CPU-intensive
# to justify multiprocessing.
for idx in range(len(chunk)):
chunk[idx] = chunk[idx].replace("ß", "s")
return chunk
def main():
with Pool(multiprocessing.cpu_count() - 1) as pool, \
open('C:\Projects\orders_new.txt', 'w', encoding='utf-8') as nf:
for chunk in pool.imap_unordered(worker, get_chunks()):
nf.write(''.join(chunk))
"""
Or to be more memory efficient, but slower:
for line in chunk:
nf.write(chunk)
"""
if __name__ == '__main__':
main()