Word Frequency Function Not Adding Single Occurances to Count-CodePudding

I've got some code my professor and I went over because idfk what I'm doing anymore. Our project was to write a script that would count how many times words appeared in a text file. He was really helpful walking me through the code and explaining it, but when we ran it, the script wasn't accounting for words that had only appeared once. How do I make it so that the words that only appear once have a count of 1, while the rest of the words keep their same count?

ie what we got:

the: 3
history: 0
learning: 4

vs what we needed:

the: 3
history: 1
learning: 4

My most obvious answer is to simply do count = 1, but that also bumps up the rest of the numbers. I'm assuming this has something to do with elif statements?

Here's the code:

def get_file_name(title):

    file_name = input(title)

    return file_name

def read_file_contents(file_name):

    #The purpose of this function is to read the contents of the
    #file that contains the words

    #First thing that we need to do is to create a Python list to
    #hold the words that are in the file

    word_list = []

    #We must always assume that the user can make a typo when
    #entering a file name. Having a program
    #"crash" is definitely not user-friendly, so we encase the
    #reading of the file within a try/except block
    #so that if an error on opening the file occcurs, we can
    #"end our program" very gracefully with
    #an error message

    try:

       with open(file_name,'r') as input_file:

           #we will red the file, one line at a time,
           #until we have read all of the lines in the file.
           #As we read each line, we will first remove the
           #"new line character" that is at the end  of each
           #line. Then, we will split the line into words
           #(words are preceeded and followed by spaces), and
           #then appended each word to our list of words

           for line in input_file:

               line = line.rstrip()

               words = line.split(' ')

               for word in words:

                   word_list.append(word)

       

    except:

        # If the file cannot be found,
        #then an "error message" is printed and it quits the program

        print("A major error has occurred!")

        quit()

    #Now, we need to return the list of words so
    #that we can do the counting.

    return word_list  


def establish_word_frequency(word_list):

    word_list.sort()

    #create a list that will contain the words and their count.
    #This list will hold strings.
    #set a count variable to 0

    frequency = []

    count = 0

    #set a previous word variable to be the empty string

    prev_word = ''

    #For each word in the word_list

    for i in range(len(word_list)):

        #see if the current word is the same as the previous word

        if word_list[i] == prev_word:

          #If it is, add one to the current count

            count  = 1

        #otherwise (meaning the word is different)

        else:

            #concatenate the current word with a
            #space, colon, space and the string equivalent of the
            #integer count (use str() for this)
            #Append the word and its count to the frequency list

            frequency.append(word_list[i]   ' : '   str(count))

            #set the count back to zero

            count = 0

            prev_word = word_list[i]

    #At this point, all of the words have been counted and
    #appended to the list
    #return the sorted list

    frequency.sort()

    return frequency



def write_word_list(file_name, word_list):

    with open(file_name, 'w') as out_file:

        #write each element of the word_list to the file

        for word in word_list:

            out_file.write(f'{word}\n')

def main():

    #get the filename by calling get_file_name
    #and put the value into a variable

    name_of_file = get_file_name('Which file do you want to analyze? ')

    #Pass the file name variable to the read_file_contents
    #function and put the result into a variable
    #that will contain the file contents

    file_contents = read_file_contents(name_of_file)

    #Pass the file contents list to the
    #establish_word_frequency function
    #and put the result into a variable

    results = establish_word_frequency(file_contents)

    #get the filename by calling get_file_name
    #file and put the value into a variable

    output_file = get_file_name('What is the name of the output file? ')                  

    #Pass the file name and the word frequency list to
    #the write_word_list function

    write_word_list(output_file, results)

   

if __name__ == '__main__':

    main()

here's a snippet of words.txt:

the
college
learning
the
process
and
history
of
papermaking
participating
the
workshop
the
history

CodePudding user response：

Instead of debugging your code, please allow me to suggest a different approach.

Sorting the list of words in order to count them is inefficient. Sorting takes O(nlogn), where n is the number of items in the list (i.e., its length).

On the other hand, if you just iterate the list and count the words using a dictionary, which is basically a hash table, meaning each operation takes O(1) time, you reduce the total time complexity to just O(n).

Here is the code:

def establish_word_frequency(word_list):

    freq = {}

    for word in word_list:
        if word in freq:
            freq[word]  = 1
        else:
            freq[word] = 1

    return freq

I think this is nice and simple, but you can also just use built-in Counter which is probably implemented similarly:

from collections import Counter

def establish_word_frequency(word_list):
    return Counter(word_list)

CodePudding user response：

Solution provided by @Orius is more efficient and readable but since you're not supposed to use dictionaries so I'm making some corrections to your version of code.

def get_file_name(title):

    file_name = input(title)

    return file_name

def read_file_contents(file_name):

    #The purpose of this function is to read the contents of the
    #file that contains the words

    #First thing that we need to do is to create a Python list to
    #hold the words that are in the file

    word_list = []

    #We must always assume that the user can make a typo when
    #entering a file name. Having a program
    #"crash" is definitely not user-friendly, so we encase the
    #reading of the file within a try/except block
    #so that if an error on opening the file occcurs, we can
    #"end our program" very gracefully with
    #an error message

    try:

       with open(file_name,'r') as input_file:

           #we will red the file, one line at a time,
           #until we have read all of the lines in the file.
           #As we read each line, we will first remove the
           #"new line character" that is at the end  of each
           #line. Then, we will split the line into words
           #(words are preceeded and followed by spaces), and
           #then appended each word to our list of words

           for line in input_file:

               line = line.rstrip()

               words = line.split(' ')

               for word in words:

                   word_list.append(word)

       

    except:

        # If the file cannot be found,
        #then an "error message" is printed and it quits the program

        print("A major error has occurred!")

        quit()

    #Now, we need to return the list of words so
    #that we can do the counting.

    return word_list  


def establish_word_frequency(word_list):

    word_list.sort()
    print(word_list)

    #create a list that will contain the words and their count.
    #This list will hold strings.
    #set a count variable to 0

    frequency = []

    count = 1  #initialise count with 1 since least count of every word is 1.

    #set a current word variable to be the empty string

    cur_word = ''

    #For each word in the word_list

    for i in range(len(word_list)):

        #see if the current word is the same as the next word

        if word_list[i] == cur_word:

          #If it is, add one to the current count

            count  = 1

        #otherwise (meaning the word is different)

        else:

            #concatenate the current word with a
            #space, colon, space and the string equivalent of the
            #integer count (use str() for this)
            #Append the word and its count to the frequency list
            #this is count of word_list[i-1]
            frequency.append(word_list[i-1]   ' : '   str(count))

            #set the count back to 1

            count = 1

            cur_word = word_list[i]   #assign last checked word as current word

    #At this point, all of the words have been counted and
    #appended to the list
    #return the sorted list

    frequency.sort()

    return frequency



def write_word_list(file_name, word_list):

    with open(file_name, 'w') as out_file:

        #write each element of the word_list to the file

        for word in word_list:

            out_file.write(f'{word}\n')

def main():

    #get the filename by calling get_file_name
    #and put the value into a variable

    name_of_file = get_file_name('Which file do you want to analyze? ')

    #Pass the file name variable to the read_file_contents
    #function and put the result into a variable
    #that will contain the file contents

    file_contents = read_file_contents(name_of_file)

    #Pass the file contents list to the
    #establish_word_frequency function
    #and put the result into a variable

    results = establish_word_frequency(file_contents)

    #get the filename by calling get_file_name
    #file and put the value into a variable

    output_file = get_file_name('What is the name of the output file? ')                  

    #Pass the file name and the word frequency list to
    #the write_word_list function

    write_word_list(output_file, results)

   

if __name__ == '__main__':

    main()

There were some mistakes in your code-

Count was initialised with 0 which is wrong because count for every word is at least 1 and not 0.
Coupling of word count and word was incorrect where you're appending frequency.

I'll suggest you to rewrite your own code after understanding where you made mistakes.