Fastest way to search many files in many directories?-CodePudding

I have a list of file names

files = (
    "myinstruction.txt",
    "myinfo.txt",
    "mydata.txt",
    "myclients.txt",
    "foo.txt",
)

and a set of directories where the files might be contained (path might lead to nested structure)

search_paths = (
   "C:/Users/Foo/Desktop/thisfolder/",
   "F:/Documents/mylibrary/",
   "F:/Folder/mylibrary/",
   "E:/Otherfolder/foolibrary/",
)

What is the most optimized function we could create to find back the full paths of our files?

CodePudding user response：

Multithreading would be ideal for this.

Firstly, make the tuple of filenames into a set for faster searching.

Then it's as simple as...

from concurrent.futures import ThreadPoolExecutor
import os

FILES = {
    "myinstruction.txt",
    "myinfo.txt",
    "mydata.txt",
    "myclients.txt",
    "foo.txt"
}

SEARCH_PATHS = [
   "C:/Users/Foo/Desktop/thisfolder/",
   "F:/Documents/mylibrary/",
   "F:/Folder/mylibrary/",
   "E:/Otherfolder/foolibrary/"
]

def process_directory(directory):
    output = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file in FILES:
                output.append(os.path.join(root, file))
    return output

result = []

with ThreadPoolExecutor() as executor:
    for rv in executor.map(process_directory, SEARCH_PATHS):
        result.extend(rv)

print(result)

In this way, each directory will be examined in a separate (concurrent) thread. As os.walk() is I/O bound, multithreading is appropriate

CodePudding user response：

As per comments, you can utilize os.walk()
Unless you are using memoization, I'm not sure there's a faster way than this

Example code:

import os

file_list = (
    "myinstruction.txt",
    "myinfo.txt",
    "mydata.txt",
    "myclients.txt",
    "foo.txt",
)

search_paths = (
   "C:/Users/Foo/Desktop/thisfolder/",
   "F:/Documents/mylibrary/",
   "F:/Folder/mylibrary/",
   "E:/Otherfolder/foolibrary/",
)

# search for files inside search_paths using os.walk()
for path in search_paths:
    for root, dirs, files in os.walk(path):
        for file in files:
            if file in file_list:
                print(os.path.join(root, file))