For my master thesis i downloaded a ton of finance related files. my objective is to find a specific set of words ("chapter 11") to flag all companies that have gone through the debt restructuring process. The problem is that i have more than 1.2 milion little files that makes the search really inefficient. For now i wrote very basic code and i reached a velocity of 1000 documents every 40-50 seconds. i was wondering if there are some specific libraries or methods (or even programming languages) to search even faster. this is the function i'm using so far
def get_items(m):
word = "chapter 11"
f = open(m, encoding='utf8')
document = f.read()
f.close()
return (word in document.lower())
# apply the function to the list of names:
l_v1 = list(map(get_items,filenames))
the size of the files varies between 5 and 4000 KB
CodePudding user response:
Try the Unix tool, grep
.
If the files are few, you can do:
grep -i "chapter 11" file1 file2 ...
Or,
grep -i "chapter 11" file*.txt
If there are many files, you can combine grep
with find
:
find . -type f | xargs grep -i "chapter 11"
Another powerful tool is ack
(written in Perl) -- see https://beyondgrep.com/.
CodePudding user response:
Well, you could use threading to split the filenames list into two or smaller lists and search simultaneously.
here's an example:
import threading
def get_items(m):
word = "chapter 11"
f = open(m, encoding='utf8')
document = f.read()
f.close()
return (word in document.lower())
# apply the function to the list of names:
l_v1 = list(map(get_items,filenames))
x = threading.Thread(target=get_items, args=(l_v1[:len(l_v1) // 2],))
y = threading.Thread(target=get_items, args=(l_v1[len(l_v1) // 2:],))
x.start()
y.start()
CodePudding user response:
Here's a slightly different approach where we use multithreading to build a list of filenames that contain the string 'chapter 11'
from concurrent.futures import ThreadPoolExecutor
filenames = [] # list of filenames
results = [] # list of filenames containing 'chapter 11'
word = 'chapter 11' # lowercase
def process(filename):
try:
with open(filename, encoding='utf-8') as infile:
if word in infile.read().lower():
results.append(filename)
except Exception:
pass
with ThreadPoolExecutor() as executor:
executor.map(process, filenames)
print(results)
EDIT:
OP has said that all files to be processed are in a single directory/folder. In that case, rather than building a list of filenames one could do this:
from concurrent.futures import ThreadPoolExecutor
from os.path import join
from os import listdir
import re
results = [] # list of filenames containing 'chapter 11'
cp = re.compile('chapter 11', re.IGNORECASE)
DIR = '' # directory containing files to be processed
def process(filename):
try:
with open(join(DIR, filename), encoding='utf-8') as infile:
if cp.search(infile.read()):
results.append(filename)
except Exception:
pass
with ThreadPoolExecutor() as executor:
executor.map(process, listdir(DIR))
print(results)
This change also incorporates the idea of using a regular expression for searching for the pattern which may or may not be more efficient that using in