Time to process a file is very long without import-CodePudding

My code is as follows:

import re

def get_filename():
    """gets the file"""
    filename = input("Please enter filename: ")
    return filename
    
def get_words_from_file(filename):
    """getting the data and printing it word by word"""
    infile = open(filename, 'r', encoding='utf-8')
    outfile = infile.read().splitlines()
    words = []
    reading = False
    for let in outfile:
        if let.startswith("*** START OF")and reading == False:
            reading = True
        elif let.startswith("*** END OF SYNTHETIC TEST CASE ***") or let.startswith("*** END"):
            return words
        elif reading:
            let = let.lower()
            words.extend(re.findall("[a-z] [-'][a-z] |[a-z] [']?|[a-z] ", let))
    return words

def calculate(words):
    """gjhwjghwg2"""
    all_times = []
    max_word_length = 0
    number_of_words = len(words)
    average = sum(len(word) for word in words) / number_of_words
    for word in words:
        if len(word)>max_word_length:
            max_word_length=len(word)
    for word in words:
        total =words.count(word)
        all_times.append(total)
    max_frequency = max(all_times)
    
    result = (number_of_words, average, max_word_length, max_frequency)
    return result

def print_results(stats_tuple):
    """calculate the goods"""
    (number_of_words, average, max_word_length, max_frequency) = stats_tuple
    print("")
    print("Word summary (all words):")
    print(" Number of words = {0}".format(number_of_words))
    print(" Average word length = {:.2f}".format(average))
    print(" Maximum word length = {0}".format(max_word_length))
    print(" Maximum frequency = {0}".format(max_frequency))

def main():
    """ghkghwgjkwhgw"""
    filename = get_filename()
    data = get_words_from_file(filename)
    stats = calculate(data)
    print_results(stats)
main()

I have a text file that is very large so when I try and run it, it takes a very long time. Just wondering if there is something I need to change in order for it to not take as long. The code works fine elsewhere but this text file has 75,000 words.

CodePudding user response：

From what I see I would assume that

    for word in words:
        total =words.count(word)
        all_times.append(total)

is the problem, because its runtime is O(len(words)**2). What about changing this to

    frequency = {word: 0 for word in words}
    for word in words:
        frequency[word]  = 1
    max_frequency = max(frequency.values())

Note: I did not test this code.

CodePudding user response：

In get_words_from_file:

do not read the whole file and then split lines - just iterate over the lines
compile your regex pattern once and use it
do you really need that lower() call?

CodePudding user response：

You have a text file with N-words. You are iterating over it 5 times

get_words_from_file
average = sum(len(word) for word in words) / number_of_words

3)  for word in words:
        if len(word)>max_word_length:
            max_word_length=len(word)

4)  for word in words:        
        total =words.count(word) # here is the fifth time
        all_times.append(total)

All and all, your time complexity is 2N N^2; O(N^2).

You can save a lot of time by doing only two iterations. In the first iteration over the words, make a dictionary of key=the word value=number of appearances

dict[str,int]

And the second iteration will be to calculate all the other measures.

In the worse case (if all the words are different), the time complexity will be only 2N.

Most of the time, it will be much faster because of all the word repetitions.