Only returning word counts for words >= 5 characters & sort by key value (highest to lowest)-CodePudding

I have a .txt file that I am looking to return the count of each time a word appears in it. I got the code to work, but now I want to refine down to only returning words that are 5 or more characters long. I added in "len" function to a for statement, but it is still returning all words. Any help would be greatly appreciated.

I also am wondering if it is possible for me to sort by key count, to return the words with highest counts first.

import string
import os

os.chdir('mydirectory') # Changes directory.

speech = open("obamaspeech.txt", "r") # Opens file.
  
emptyDict = dict() # Creates dictionary

for line in speech:
    line = line.strip() # Removes leading spaces.
    line = line.lower() # Convert to lowercase.
    line = line.translate(line.maketrans("", "", string.punctuation)) # Removes punctuation.
    words = line.split(" ") # Splits lines into words. 
    for word in words:
        if len(word) >= 5 in emptyDict: 
            emptyDict[word] = emptyDict[word]   1
        else:
            emptyDict[word] = 1
  
for key in list(emptyDict.keys()):
    print(key, ":", emptyDict[key])

CodePudding user response：

I think you need a separate test for word length:

for word in words:
    if len(word) >= 5:
        if word in emptyDict: 
            emptyDict[word] = emptyDict[word]   1
        else:
            emptyDict[word] = 1

CodePudding user response：

Another answer has shown you how to modify your code to the desired effect. On the other hand, here is another implementation. Note that counting words and sorting them by frequency is made much easier with the help of list comprehension and the Counter object from the collections module.

from collections import Counter 

os.chdir('mydirectory')
with open("obamaspeech.txt", "r") as speech:
    full_speech = speech.read().lower.translate(line.maketrans("", "", string.punctuation))

words = full_speech.split()
count = Counter([w for w in words if len(w)>=5])
for w,k in count.most_common():
    print(f"{w}: {k} time(s)")