Python, Pandas: Faster File Search than os.path?-CodePudding

I have a pandas df with file names that need to be searched/matched in a directory tree.

I've been using the following but it crashes with larger directory structures. I record whether or not they are present in 2 lists.

found = []
missed = []

for target_file in df_files['Filename']:
    
    for (dirpath, dirnames, filenames) in os.walk(DIRECTORY_TREE):
        if target_file in filenames:
            found.append(os.path.join(dirpath,target_file))
        else:
            missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)

I've read that scandir is quicker and will handle larger directory trees. If true, how might this be rewritten?

My attempt:

found = []
missed = []

for target_file in df_files['Filename']:
    
    for item in os.scandir(DIRECTORY_TREE):
        if item.is_file() and item.name() == target_file:
            found.append(os.path.join(dirpath,target_file))
        else:
            missed.append(target_file)
            
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)

This runs (fast), but everything ends up in the "missed" list.

CodePudding user response：

Scan your directories only once and convert it to a dataframe.

Example on my venv directory:

import pandas as pd
import pathlib

DIRECTORY_TREE = pathlib.Path('./venv').resolve()
data = [(str(pth.parent), pth.name) for pth in DIRECTORY_TREE.glob('**/*') if pth.is_file()]
df_path = pd.DataFrame(data, columns=['Directory', 'Filename'])

df_files = pd.DataFrame({'Filename': ['__init__.py']})

Now you can use df_path to lookup filenames from df_files with merge:

out = (df_files.merge(df_path, on='Filename', how='left')
               .value_counts('Filename').to_frame('Found'))
out['Missed'] = len(df_path) - out['Found']
print(out.reset_index())

# Output
      Filename  Found  Missed
0  __init__.py   5837  105418