I'm making a program where I am supposed to compare text files by returning a list of all the words that come up in the file, and the number of times they come up. I have to disregard a list of words called stopwords so they won't be checked for the number of times they come up. For the first part I need to check if the word is in the stopwords, if it is, i don't count that word, if it isn't in stopwords then I make a brand new row for that word in a dataframe, assuming it doesn't already exist in the data frame, and increment the appearance frequency by 1. Each text file will have a column. I am a little stuck on this part however. I have bits of the code already but I need to fill in the blanks. Here is what I have so far:
from tkinter.filedialog import askdirectory
import glob
import os
import pandas as pd
def main():
df = pd.DataFrame(columns =["TEXT FILE NAMES HERE..."])
data_directory = askdirectory(initialdir = "/School_Files/CISC_121/Assignments/Assignment3/Data_Files")
stopwords = open(os.getcwd() "/" "StopWords.txt")
text_files = glob.glob(data_directory "/" "*.txt")
for f in text_files:
infile = open(f, "r", encoding = "UTF-8")
#now read the file and do all the word-counting etc...
lines = infile.readlines()
for line in lines:
x = 0
words = line.split()
while (x < len(words)):
"""
Check if the word is in the stopwords
If it isn't, then add the word into a row in a dataframe, for the first occurence, then
increment the value by 1
Have a column for each book
"""
for line in infile:
if word in line:
found = True
word =1
else:
found = False
x = x 1
main()
If anyone can help me finish this section I'd really appreciate it. Please show the change in code. Thanks in advance!
CodePudding user response:
I see that you just want to count the occurrence of the words. For this you could use a dictionary instead of a Dataframe.
And for stopwords, read it to a list.
Try the below code.
stopwords = []
count_dictionary {}
with open(os.getcwd() "/" "StopWords.txt") as f:
stopwords = f.read().splitlines()
#your code
while (x < len(words)):
if word not in stopwords:
if word in count_dictionary :
count_dictionary[word] = 1
else:
count_dictionary[word] = 1