Home > Net >  Fastest way to search many files in many directories?
Fastest way to search many files in many directories?

Time:10-19

I have a list of file names

files = (
    "myinstruction.txt",
    "myinfo.txt",
    "mydata.txt",
    "myclients.txt",
    "foo.txt",
)

and a set of directories where the files might be contained (path might lead to nested structure)

search_paths = (
   "C:/Users/Foo/Desktop/thisfolder/",
   "F:/Documents/mylibrary/",
   "F:/Folder/mylibrary/",
   "E:/Otherfolder/foolibrary/",
)

What is the most optimized function we could create to find back the full paths of our files?

CodePudding user response:

Multithreading would be ideal for this.

Firstly, make the tuple of filenames into a set for faster searching.

Then it's as simple as...

from concurrent.futures import ThreadPoolExecutor
import os

FILES = {
    "myinstruction.txt",
    "myinfo.txt",
    "mydata.txt",
    "myclients.txt",
    "foo.txt"
}

SEARCH_PATHS = [
   "C:/Users/Foo/Desktop/thisfolder/",
   "F:/Documents/mylibrary/",
   "F:/Folder/mylibrary/",
   "E:/Otherfolder/foolibrary/"
]

def process_directory(directory):
    output = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file in FILES:
                output.append(os.path.join(root, file))
    return output

result = []

with ThreadPoolExecutor() as executor:
    for rv in executor.map(process_directory, SEARCH_PATHS):
        result.extend(rv)

print(result)

In this way, each directory will be examined in a separate (concurrent) thread. As os.walk() is I/O bound, multithreading is appropriate

CodePudding user response:

As per comments, you can utilize os.walk()
Unless you are using memoization, I'm not sure there's a faster way than this

Example code:

import os

file_list = (
    "myinstruction.txt",
    "myinfo.txt",
    "mydata.txt",
    "myclients.txt",
    "foo.txt",
)

search_paths = (
   "C:/Users/Foo/Desktop/thisfolder/",
   "F:/Documents/mylibrary/",
   "F:/Folder/mylibrary/",
   "E:/Otherfolder/foolibrary/",
)

# search for files inside search_paths using os.walk()
for path in search_paths:
    for root, dirs, files in os.walk(path):
        for file in files:
            if file in file_list:
                print(os.path.join(root, file))
  • Related