Home > front end >  Iterate through a folder which contains 5 more folders each with 500 text files to match words
Iterate through a folder which contains 5 more folders each with 500 text files to match words

Time:11-26

I have a folder which contain 5 folders, with round 450-550 text files each. The text file has around 1-12 sentences varying in length, seperated by a tab, like this:

i love burgers 
i want to eat a burger 
etc

I want to create a code which asks the user to input a search term and then goes inside each folder, opens and reads each text file, and matches how many times that search term appears. Then, go back out to the next folder, rinse and repeat till it goes through every folder and every text file.

So the output should be something like this:

input search term: good 
the search term appears this many times __ in the following files
file name 001.txt  
file name 002.txt  
file name 003.txt  

Here is some of the code I have so far:

from pathlib import Path
import os
from os.path import isdir, isfile
import nltk

search_word = input("Please enter the word you want to search for: ")
punctuation = "he fold!,:;-_'.?"

location = Path(r'the folder')

os.chdir(location)
print(Path.cwd())

fileslist = os.listdir(Path.cwd())
print(fileslist)

for file in fileslist:
    if isdir(file):
        os.chdir(file)
        print(Path.cwd())

        content = os.listdir(Path.cwd())
        
        for document in content:      
            with open(document,'r') as infile:
                data = []
                for line in infile:
                    data  = [line.strip(punctuation)]
                print(data)
        
        os.chdir('../')
        print(Path.cwd())
    else:
        os.chdir(location)

I have tried watching some YouTube videos on how to do it, but I haven't been able to figure it out.

CodePudding user response:

If you just want to count the number of occurrences of a word, for example, in a set of .txt files, something like this will do it:

from pathlib import Path

word = input('Enter the word you want to search for: ')
path = Path('/some/folder')
counter = {}

for file in path.rglob('*.txt'):
    if file.is_file():
        counter[file] = file.read_text().count(word)

print(
    f'The search term "{word}" appears {sum(counter.values())}',
    'times in the following files:'
)

for file in [_ for _ in counter if counter[_]]:
    print(f'{file}: {counter[file]} times')

CodePudding user response:

this would be a perfect use case for the walk() function in the os module.

given a start directory os.walk() recursively iterates through the directory structure and provides a tuple of (current_directory, directory_names, file_names)

then you can iterate through the filenames to check which ones end with '.txt' and open that file and use a generator expression to check each line of the file to see if the line contains the search term and sum up the results of the generator with the sum() function

import os
import os.path
    
STARTDIR=input("directory: ")
SEARCH=input("search term: ")
total = 0

for dirname, dirlist, filelist in os.walk(STARTDIR):
    for filename in filelist:
        if filename.endswith(".txt"):
            # get full filename to use with open() function
            fullname = os.path.join(dirname, filename) name

            # use generator expression to iterate over the lines of the
            # opened file and sum up the results (True == 1 for sum())
            count = sum(SEARCH in line for line in open(fullname))

            # if non zero count then print the filename and count
            if count:
                print(f"{fullname} contains {count} lines with {SEARCH}")

            total  = count

print(f"{SEARCH} occurred a total of {total} times")

SAMPLE OUTPUT:

directory: c:\downloads\test    
search term: hello

c:\downloads\test\a\aa\info.txt contains 1 lines with hello
c:\downloads\test\a\aa\log.txt contains 1 lines with hello
c:\downloads\test\a\bb\greeting.txt contains 1 lines with hello
c:\downloads\test\b\cc\control.txt contains 3 lines with hello
c:\downloads\test\b\cc\dumb.txt contains 1 lines with hello
c:\downloads\test\b\cc\info.txt contains 4 lines with hello
c:\downloads\test\c\aa\dog.txt contains 2 lines with hello
c:\downloads\test\c\dd\good.txt contains 1 lines with hello
hello occurred a total of 14 times
  • Related