Currently, I have
import re
import string
input_file = open('documents.txt', 'r')
stopwords_file = open('stopwords_en.txt', 'r')
stopwords_list = []
for line in stopwords_file.readlines():
stopwords_list.extend(line.split())
stopwords_set = set(stopwords_list)
word_count = {}
for line in input_file.readlines():
words = line.strip()
words = words.translate(str.maketrans('','', string.punctuation))
words = re.findall('\w ', line)
for word in words:
if word.lower() in stopwords_set:
continue
word = word.lower()
if not word in word_count:
word_count[word] = 1
else:
word_count[word] = word_count[word] 1
word_index = sorted(word_count.keys())
for word in word_index:
print (word, word_count[word])
What it does is parses through a txt file I have, removes stopwords, and outputs the number of times a word appears in the document it is reading from.
The problem is that the txt file is not one file, but five.
The text in the document looks something like this:
1
The cat in the hat was on the mat
2
The rat on the mat sat
3
The bat was fat and named Pat
Each "document" is a line preceded by the document ID number.
In Python, I want to find a way to go through 1, 2, and 3 and count how many times a word appears in an individual document, as well as the total amount of times a word appears in the whole text file - which my code currently does.
i.e Mat appears 2 times in the text document. It appears in Document 1 and Document 2 Ideally less wordy.
CodePudding user response:
Give this a try:
import re
import string
def count_words(file_name):
word_count = {}
with open(file_name, 'r') as input_file:
for line in input_file:
if line.startswith("document"):
doc_id = line.split()[0]
words = line.strip().split()[1:]
for word in words:
word = word.translate(str.maketrans('','', string.punctuation)).lower()
if word in word_count:
word_count[word][doc_id] = word_count[word].get(doc_id, 0) 1
else:
word_count[word] = {doc_id: 1}
return word_count
word_count = count_words("documents.txt")
for word, doc_count in word_count.items():
print(f"{word} appears in: {doc_count}")
CodePudding user response:
You have deleted your previous similar question and with it my answer, so I'm not sure if it's a good idea to answer again. I'll give a slightly different answer, without groupby
, although I think it was fine.
You could try:
import re
from collections import Counter
from string import punctuation
with open("stopwords_en.txt", "r") as file:
stopwords = set().union(*(line.rstrip().split() for line in file))
translation = str.maketrans("", "", punctuation)
re_new_doc = re.compile(r"(\d )\s*$")
with open("documents.txt", "r") as file:
word_count, doc_no = {}, 0
for line in file:
match = re_new_doc.match(line)
if match:
doc_no = int(match[1])
continue
line = line.translate(translation)
for word in re.findall(r"\w ", line):
word = word.casefold()
if word in stopwords:
continue
word_count.setdefault(word, []).append(doc_no)
word_count_overall = {word: len(docs) for word, docs in word_count.items()}
word_count_docs = {word: Counter(docs) for word, docs in word_count.items()}
- I would make the translation table only once, beforehand, not for each line again.
- The regex for the identification of a new document
(\d )\s*$"
looks for digits at the beginning of a line and nothing else, except maybe some whitespace, until the line break. You have to adjust it if the identifier follows a different logic. word_count
records each occurrence of a word in a list with the number of the current document.word_count_overall
just takes the length of the resp. lists to get the overall count of a word.word_count_docs
does apply aCounter
on the lists to get the counts per document for each word.