Text Comparing Program-CodePudding

I'm making a program where I am supposed to compare text files by returning a list of all the words that come up in the file, and the number of times they come up. I have to disregard a list of words called stopwords so they won't be checked for the number of times they come up. For the first part I need to check if the word is in the stopwords, if it is, i don't count that word, if it isn't in stopwords then I make a brand new row for that word in a dataframe, assuming it doesn't already exist in the data frame, and increment the appearance frequency by 1. Each text file will have a column. I am a little stuck on this part however. I have bits of the code already but I need to fill in the blanks. Here is what I have so far:

from tkinter.filedialog import askdirectory
import glob

import os 
import pandas as pd


def main():
    df = pd.DataFrame(columns =["TEXT FILE NAMES HERE..."])
    data_directory = askdirectory(initialdir = "/School_Files/CISC_121/Assignments/Assignment3/Data_Files")
    stopwords = open(os.getcwd()   "/"   "StopWords.txt") 



    text_files = glob.glob(data_directory   "/"   "*.txt")



    for f in text_files:
        infile = open(f, "r", encoding = "UTF-8")
        #now read the file and do all the word-counting etc...
        lines = infile.readlines()
        for line in lines:
            x = 0
            words = line.split()
            while (x < len(words)):
                """
                Check if the word is in the stopwords
                If it isn't, then add the word into a row in a dataframe, for the first occurence, then
                increment the value by 1
                Have a column for each book 
                """
                for line in infile:
                    if word in line:
                        found = True
                        word  =1 
                    else:
                        found = False

                x = x 1

main()

If anyone can help me finish this section I'd really appreciate it. Please show the change in code. Thanks in advance!

CodePudding user response：

I see that you just want to count the occurrence of the words. For this you could use a dictionary instead of a Dataframe.

And for stopwords, read it to a list.

Try the below code.

stopwords = []
count_dictionary {}

with open(os.getcwd()   "/"   "StopWords.txt") as f:
    stopwords = f.read().splitlines()

#your code

while (x < len(words)):
    if word not in stopwords:
        if word in count_dictionary :
            count_dictionary[word]  = 1
        else:
            count_dictionary[word] = 1