Home > Software design >  Python - Improve performance on reading flat file line-by-line
Python - Improve performance on reading flat file line-by-line

Time:10-30

I have a large .txt file which I want to read one line at a time (rather than reading it all into memory, to avoid out-of-memory issues), and then extract all unique characters present in the file. I have the below code which works well for small files but when I run it on a large file (which is the kind of files I need to typically carry out the exercise on) it runs extremely slowly e.g. around 1 hour for a 10GB file. Can someone please suggest how I can improve the performance, for example by re-arranging the operations being performed, avoiding duplicate runs or using faster functions.

Thanks

def flatten(t):
'''Flatten list of lits'''
    return [item for sublist in t for item in sublist]

input_file = r'C:\large_text_file.txt'
output_file = r'C:\char_set.txt'

#Parameters
case_sensitive = False
remove_crlf = True

#Extract all unique characters from file
charset = []
with open(input_file, 'r') as infile:
    for line in infile:
        if remove_crlf:
            charset.append(list(line.rstrip())) #remove CRLF
        else:
            charset.append(list(line))
        
        charset = flatten(charset) #flatten the list of lists

        if not(case_sensitive):
            charset = (map(lambda x: x.upper(), charset)) #convert to upper case

        charset = list(dict.fromkeys(charset)) #remove duplicates

charset.sort(key=None, reverse=False) #sort character set in ascending order

infile.close() #close the input file

#Output the charater set
with open(output_file, 'w') as f:
    for char in charset:
        f.write(char)

CodePudding user response:

You can very much simplify that to make it linear:

charset = set()  # use a real set!
with open(input_file, 'r') as infile:
    for line in infile:
        if remove_crlf:
            line = line.strip()
        if not case_sensitive:
            line = line.upper()
        charset.update(line)

with open(output_file, 'w') as f:
    for char in sorted(charset):
        f.write(char)

What made it quadratic, were all these lines:

charset = flatten(charset) #flatten the list of lists
charset = map(lambda x: x.upper(), charset)
charset = list(dict.fromkeys(charset))

where you keep performing operations on an ever-growing list instead of just the current line.

Some documentation:

  • Related