How do I make each line in a text file its own dictionary to sort through in Python?-CodePudding

Currently, I have

import re 
import string

input_file = open('documents.txt', 'r')
stopwords_file = open('stopwords_en.txt', 'r')
stopwords_list = []

for line in stopwords_file.readlines():
  stopwords_list.extend(line.split())

stopwords_set = set(stopwords_list)

word_count = {}
for line in input_file.readlines():
    words = line.strip()
    words = words.translate(str.maketrans('','', string.punctuation))
    words = re.findall('\w ', line)
    for word in words: 
      if word.lower() in stopwords_set:
        continue
      word = word.lower()
      if not word in word_count: 
        word_count[word] = 1
      else: 
        word_count[word] = word_count[word]   1

word_index = sorted(word_count.keys())
for word in word_index:
  print (word, word_count[word])

What it does is parses through a txt file I have, removes stopwords, and outputs the number of times a word appears in the document it is reading from.

The problem is that the txt file is not one file, but five.

The text in the document looks something like this:

1 
The cat in the hat was on the mat

2 
The rat on the mat sat

3
The bat was fat and named Pat
Each "document" is a line preceded by the document ID number.

In Python, I want to find a way to go through 1, 2, and 3 and count how many times a word appears in an individual document, as well as the total amount of times a word appears in the whole text file - which my code currently does.

i.e Mat appears 2 times in the text document. It appears in Document 1 and Document 2 Ideally less wordy.

CodePudding user response：

Give this a try:

import re
import string

def count_words(file_name):
    word_count = {}
    with open(file_name, 'r') as input_file:
        for line in input_file:
            if line.startswith("document"):
                doc_id = line.split()[0]
                words = line.strip().split()[1:]
                for word in words:
                    word = word.translate(str.maketrans('','', string.punctuation)).lower()
                    if word in word_count:
                        word_count[word][doc_id] = word_count[word].get(doc_id, 0)   1
                    else:
                        word_count[word] = {doc_id: 1}
    return word_count

word_count = count_words("documents.txt")
for word, doc_count in word_count.items():
    print(f"{word} appears in: {doc_count}")

CodePudding user response：

You have deleted your previous similar question and with it my answer, so I'm not sure if it's a good idea to answer again. I'll give a slightly different answer, without groupby, although I think it was fine.

You could try:

import re
from collections import Counter
from string import punctuation

with open("stopwords_en.txt", "r") as file:
    stopwords = set().union(*(line.rstrip().split() for line in file))
translation = str.maketrans("", "", punctuation)
re_new_doc = re.compile(r"(\d )\s*$")
with open("documents.txt", "r") as file:
    word_count, doc_no = {}, 0
    for line in file:
        match = re_new_doc.match(line)
        if match:
            doc_no = int(match[1])
            continue
        line = line.translate(translation)
        for word in re.findall(r"\w ", line):
            word = word.casefold()
            if word in stopwords:
                continue
            word_count.setdefault(word, []).append(doc_no)

word_count_overall = {word: len(docs) for word, docs in word_count.items()}
word_count_docs = {word: Counter(docs) for word, docs in word_count.items()}

I would make the translation table only once, beforehand, not for each line again.
The regex for the identification of a new document (\d )\s*$" looks for digits at the beginning of a line and nothing else, except maybe some whitespace, until the line break. You have to adjust it if the identifier follows a different logic.
word_count records each occurrence of a word in a list with the number of the current document.
word_count_overall just takes the length of the resp. lists to get the overall count of a word.
word_count_docs does apply a Counter on the lists to get the counts per document for each word.