I am trying to list files from Azure data lake storage using a pattern using os.walk. It is too slow and not accepted by the business. Is there any faster way to do this?
code snippet below:
# pattern holds something like '201707' (YYYYMM) as files are dated.
pattern="*{0}*.*".format(batch_no)
print(pattern)
files_list=[]
#os Walk to get file paths
for root in root_list:
for path, subdirs, files in os.walk(root):
for name in files:
if fnmatch(name.upper(), pattern.upper()):
files_list.append(str(batch_no) path.replace("dbfs/","") "/" name)
CodePudding user response:
Here's a function that you could use to get all the files that match your pattern. Invoke with recursive=True if you need to examine sub-directories:
import glob
import os
def getBatchFiles(root, batch, recursive=False):
pattern = f'*{batch}*.*'
gp = os.path.join(root, '**', pattern) if recursive else os.path.join(root, pattern)
return glob.glob(gp, recursive=recursive)
CodePudding user response:
To answer the question how to make it faster, you need to first answer the question "why it is slow?".
To answer that question, you will need to profile your code.
You could use cProfile
from the standard library, or an external one like line_profiler.
Running your code through the profiler will show you which parts take the most time. Once you've identified those you can think of alternatives.
I would hazard a guess that retrieving the names from the azure cloud storage is what takes the most time. Maybe there is a different (faster) API you can use for that?