Search a very large directory for a file containing text in it's name-CodePudding

I have a network share that contains around 300,000 files on it and it's constantly changing (files added and removed). I want to search the directory for specific text to find certain files within this directory. I have trimmed my method down about as far as I can, but it still takes over 6 minutes to complete. I can probably do it manually around the same time, depending on the number of strings I'm searching for. I want to multithread or multiprocess it, but I'm uncertain how this can be done on a single call: i.e.,

for filename in os.scandir(sourcedir).

Can anyone please help me figure this out?

def scan(sourcedir:str, oset:set[str]|str) -> set[str]:
    found = set()
        for filename in os.scandir(sourcedir):
            for ordr in oset:
                if ordr in filename.name:
                    print(filename.name)
                    found.add(filename.name)
                    break

RESULTS FROM A TYPICAL CALL: 516 function calls in 395.033 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function) 6 0.000 0.000 0.003 0.000 :39(isdir) 6 0.000 0.000 1.346 0.224 :94(samefile) 12 0.000 0.000 0.001 0.000 :103(join) 30 0.000 0.000 0.000 0.000 :150(splitdrive) 6 0.000 0.000 0.000 0.000 :206(split) 6 0.000 0.000 0.000 0.000 :240(basename) 6 0.000 0.000 0.000 0.000 :35(_get_bothseps) 1 0.000 0.000 0.000 0.000 :545(normpath) 1 0.000 0.000 0.000 0.000 :577(abspath) 1 0.000 0.000 395.033 395.033 :1() 1 0.000 0.000 395.033 395.033 CopyOrders.py:31(main) 1 389.826 389.826 389.976 389.976 CopyOrders.py:67(scan) 1 0.000 0.000 5.056 5.056 CopyOrders.py:88(copy) 1 0.000 0.000 0.000 0.000 getopt.py:56(getopt) 6 0.000 0.000 0.001 0.000 shutil.py:170(_copyfileobj_readinto) 6 0.000 0.000 1.346 0.224 shutil.py:202(_samefile) 18 0.000 0.000 1.493 0.083 shutil.py:220(_stat) 6 0.001 0.000 4.295 0.716 shutil.py:226(copyfile) 6 0.000 0.000 0.756 0.126 shutil.py:290(copymode) 6 0.000 0.000 5.054 0.842 shutil.py:405(copy) 6 0.000 0.000 0.000 0.000 {built-in method _stat.S_IMODE} 6 0.000 0.000 0.000 0.000 {built-in method _stat.S_ISDIR} 6 0.000 0.000 0.000 0.000 {built-in method _stat.S_ISFIFO} 1 0.000 0.000 395.033 395.033 {built-in method builtins.exec} 6 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr} 73 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 38 0.000 0.000 0.000 0.000 {built-in method builtins.len} 6 0.000 0.000 0.000 0.000 {built-in method builtins.min} 14 0.003 0.000 0.003 0.000 {built-in method builtins.print} 12 2.180 0.182 2.180 0.182 {built-in method io.open} 1 0.000 0.000 0.000 0.000 {built-in method nt._getfullpathname} 1 0.000 0.000 0.000 0.000 {built-in method nt._path_normpath} 6 0.012 0.002 0.012 0.002 {built-in method nt.chmod} 49 0.000 0.000 0.000 0.000 {built-in method nt.fspath} 1 0.149 0.149 0.149 0.149 {built-in method nt.scandir} 36 2.841 0.079 2.841 0.079 {built-in method nt.stat} 12 0.000 0.000 0.000 0.000 {built-in method sys.audit} 12 0.019 0.002 0.019 0.002 {method 'exit' of '_io._IOBase' objects} 6 0.000 0.000 0.000 0.000 {method 'exit' of 'memoryview' objects} 6 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 36 0.000 0.000 0.000 0.000 {method 'find' of 'str' objects} 12 0.001 0.000 0.001 0.000 {method 'readinto' of '_io.BufferedReader' objects} 30 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects} 6 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects} 6 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects}

CodePudding user response：

You could try glob

I don't have a directory with 300,000 files to test it on, but I'm assuming it would be pretty quick, (a few seconds).

import glob

sourcedir = r'path\to\your\files'
oset = ['some','list','not','shown','in','your','code']

found = []
for ordr in oset:
# Get a list of all files in the "sourcedir" directory with "ordr" in the filename
    files = [f for f in glob.glob(f"{sourcedir}\*{ordr}*")]
    found.extend(files)

print('\n'.join(found))

CodePudding user response：

Since you're only interested in the file names and not any of the other file attributes, you should not use os.scandir to incur the overhead of building objects with all the file attributes. Use os.listdir instead to retrieve just a list of file names.

Secondly, you can use a regex of an alternation pattern instead to more efficiently search for multiple substrings since the re module is written in the much faster C language.

import re

def scan(sourcedir:str, oset:set[str]) -> set[str]:
    regex = re.compile('|'.join(map(re.escape, oset)))
    return [name for name in os.listdir(sourcedir) if regex.search(name)]