My code is as follows:
import re
def get_filename():
"""gets the file"""
filename = input("Please enter filename: ")
return filename
def get_words_from_file(filename):
"""getting the data and printing it word by word"""
infile = open(filename, 'r', encoding='utf-8')
outfile = infile.read().splitlines()
words = []
reading = False
for let in outfile:
if let.startswith("*** START OF")and reading == False:
reading = True
elif let.startswith("*** END OF SYNTHETIC TEST CASE ***") or let.startswith("*** END"):
return words
elif reading:
let = let.lower()
words.extend(re.findall("[a-z] [-'][a-z] |[a-z] [']?|[a-z] ", let))
return words
def calculate(words):
"""gjhwjghwg2"""
all_times = []
max_word_length = 0
number_of_words = len(words)
average = sum(len(word) for word in words) / number_of_words
for word in words:
if len(word)>max_word_length:
max_word_length=len(word)
for word in words:
total =words.count(word)
all_times.append(total)
max_frequency = max(all_times)
result = (number_of_words, average, max_word_length, max_frequency)
return result
def print_results(stats_tuple):
"""calculate the goods"""
(number_of_words, average, max_word_length, max_frequency) = stats_tuple
print("")
print("Word summary (all words):")
print(" Number of words = {0}".format(number_of_words))
print(" Average word length = {:.2f}".format(average))
print(" Maximum word length = {0}".format(max_word_length))
print(" Maximum frequency = {0}".format(max_frequency))
def main():
"""ghkghwgjkwhgw"""
filename = get_filename()
data = get_words_from_file(filename)
stats = calculate(data)
print_results(stats)
main()
I have a text file that is very large so when I try and run it, it takes a very long time. Just wondering if there is something I need to change in order for it to not take as long. The code works fine elsewhere but this text file has 75,000 words.
CodePudding user response:
From what I see I would assume that
for word in words:
total =words.count(word)
all_times.append(total)
is the problem, because its runtime is O(len(words)**2). What about changing this to
frequency = {word: 0 for word in words}
for word in words:
frequency[word] = 1
max_frequency = max(frequency.values())
Note: I did not test this code.
CodePudding user response:
In get_words_from_file
:
- do not read the whole file and then split lines - just iterate over the lines
- compile your regex pattern once and use it
- do you really need that
lower()
call?
CodePudding user response:
You have a text file with N-words. You are iterating over it 5 times
- get_words_from_file
- average = sum(len(word) for word in words) / number_of_words
3) for word in words:
if len(word)>max_word_length:
max_word_length=len(word)
4) for word in words:
total =words.count(word) # here is the fifth time
all_times.append(total)
All and all, your time complexity is 2N N^2; O(N^2).
You can save a lot of time by doing only two iterations. In the first iteration over the words, make a dictionary of key=the word value=number of appearances
dict[str,int]
And the second iteration will be to calculate all the other measures.
In the worse case (if all the words are different), the time complexity will be only 2N.
Most of the time, it will be much faster because of all the word repetitions.