I have a list of file names
files = (
"myinstruction.txt",
"myinfo.txt",
"mydata.txt",
"myclients.txt",
"foo.txt",
)
and a set of directories where the files might be contained (path might lead to nested structure)
search_paths = (
"C:/Users/Foo/Desktop/thisfolder/",
"F:/Documents/mylibrary/",
"F:/Folder/mylibrary/",
"E:/Otherfolder/foolibrary/",
)
What is the most optimized function we could create to find back the full paths of our files?
CodePudding user response:
Multithreading would be ideal for this.
Firstly, make the tuple of filenames into a set for faster searching.
Then it's as simple as...
from concurrent.futures import ThreadPoolExecutor
import os
FILES = {
"myinstruction.txt",
"myinfo.txt",
"mydata.txt",
"myclients.txt",
"foo.txt"
}
SEARCH_PATHS = [
"C:/Users/Foo/Desktop/thisfolder/",
"F:/Documents/mylibrary/",
"F:/Folder/mylibrary/",
"E:/Otherfolder/foolibrary/"
]
def process_directory(directory):
output = []
for root, _, files in os.walk(directory):
for file in files:
if file in FILES:
output.append(os.path.join(root, file))
return output
result = []
with ThreadPoolExecutor() as executor:
for rv in executor.map(process_directory, SEARCH_PATHS):
result.extend(rv)
print(result)
In this way, each directory will be examined in a separate (concurrent) thread. As os.walk() is I/O bound, multithreading is appropriate
CodePudding user response:
As per comments, you can utilize os.walk()
Unless you are using memoization, I'm not sure there's a faster way than this
Example code:
import os
file_list = (
"myinstruction.txt",
"myinfo.txt",
"mydata.txt",
"myclients.txt",
"foo.txt",
)
search_paths = (
"C:/Users/Foo/Desktop/thisfolder/",
"F:/Documents/mylibrary/",
"F:/Folder/mylibrary/",
"E:/Otherfolder/foolibrary/",
)
# search for files inside search_paths using os.walk()
for path in search_paths:
for root, dirs, files in os.walk(path):
for file in files:
if file in file_list:
print(os.path.join(root, file))