most efficient way to search for a word in multiple files-CodePudding

For my master thesis i downloaded a ton of finance related files. my objective is to find a specific set of words ("chapter 11") to flag all companies that have gone through the debt restructuring process. The problem is that i have more than 1.2 milion little files that makes the search really inefficient. For now i wrote very basic code and i reached a velocity of 1000 documents every 40-50 seconds. i was wondering if there are some specific libraries or methods (or even programming languages) to search even faster. this is the function i'm using so far

def get_items(m):
    word = "chapter 11"
    f = open(m, encoding='utf8')
    document = f.read()
    f.close()
    return (word in document.lower())
# apply the function to the list of names:
l_v1 = list(map(get_items,filenames))

the size of the files varies between 5 and 4000 KB

CodePudding user response：

Try the Unix tool, grep.

If the files are few, you can do:

grep -i "chapter 11" file1 file2 ...

Or,

grep -i "chapter 11" file*.txt

If there are many files, you can combine grep with find:

find . -type f | xargs grep -i "chapter 11"

Another powerful tool is ack (written in Perl) -- see https://beyondgrep.com/.

CodePudding user response：

Well, you could use threading to split the filenames list into two or smaller lists and search simultaneously.

threading explanation

threading library docs

here's an example:

import threading

def get_items(m):
    word = "chapter 11"
    f = open(m, encoding='utf8')
    document = f.read()
    f.close()
    return (word in document.lower())
# apply the function to the list of names:
l_v1 = list(map(get_items,filenames))

x = threading.Thread(target=get_items, args=(l_v1[:len(l_v1) // 2],))
y = threading.Thread(target=get_items, args=(l_v1[len(l_v1) // 2:],))

x.start()
y.start()

CodePudding user response：

Here's a slightly different approach where we use multithreading to build a list of filenames that contain the string 'chapter 11'

from concurrent.futures import ThreadPoolExecutor

filenames = [] # list of filenames
results = [] # list of filenames containing 'chapter 11'
word = 'chapter 11' # lowercase

def process(filename):
    try:
        with open(filename, encoding='utf-8') as infile:
            if word in infile.read().lower():
                results.append(filename)
    except Exception:
        pass

with ThreadPoolExecutor() as executor:
    executor.map(process, filenames)

print(results)

EDIT:

OP has said that all files to be processed are in a single directory/folder. In that case, rather than building a list of filenames one could do this:

from concurrent.futures import ThreadPoolExecutor
from os.path import join
from os import listdir
import re

results = [] # list of filenames containing 'chapter 11'
cp = re.compile('chapter 11', re.IGNORECASE)
DIR = '' # directory containing files to be processed

def process(filename):
    try:
        with open(join(DIR, filename), encoding='utf-8') as infile:
            if cp.search(infile.read()):
                results.append(filename)
    except Exception:
        pass

with ThreadPoolExecutor() as executor:
    executor.map(process, listdir(DIR))

print(results)

This change also incorporates the idea of using a regular expression for searching for the pattern which may or may not be more efficient that using in