Home > other >  Python: Find files in very large folder (over 100 TB)
Python: Find files in very large folder (over 100 TB)

Time:03-31

I am working on a program that compares entries in cataloguing software (Rucio) with the files in storage. From the cataloguing, I get a path to what it believes the storage location for the file is. I then search that location for the file to see if it exists there or not. I have successfully created a bash script that performs this, but it would be a lot better if it could be redone in python.

The problem I have encountered is that python will not find the files, even when I know they exist there. I have tried stuff like

if path.exists(fulladdress):
    does stuff

And providing a file I know exists it still does not find it. I suspect it has to do with the fact that the folder is huge, over 100 TB and over 287000 files, so it does not search the whole folder and therefore does not find the file.

Does there exist a python solution that works for folders that big?

Best regards Piotr

the bash script that works is:

os.system("cd; cd directory_with_files; test -e file_in_directory _exist && echo filename >> found.txt || echo filename >> not_found "

tried running this:

    def findfile(name, path):
        for dirpath, dirname, filename in os.walk(path):
            if name in filename:
                return os.path.join(dirpath, name)
    
    def compere_checksum(not_missing_files):
        not_missing_files_file = open(not_missing_files, 'r')
        lines_not_missing_files_file = not_missing_files_file.readlines()
    
        #Extract a list of fiels i know exist
        for line in lines_not_missing_files_file:
            line.replace(' ','')
            line_list=line.split(",")
            address=line_list[0].replace("LUND: file://", "")
            #address= path to the folder 
            fille=address[address.rindex('/') 1:]
            #fille the mane of the file
            address=address.replace(fille,"")

            #search for the file using bash
            os.system("test -e {} && echo Found {}".format(line_list[0],fille))
            
            #search for the file using python function abovea
            filepath=findfile(address,fille)
            print(filepath)

address is something along the lines of "/projects/dir/dir/dir/dir/dir/mc20/v12/4.0GeV/v2.2.1-3e/"

and fille is looks like this "mc_v12-4GeV-3e-inclusive_run1310195_t1601591250.root"

The script returns:

Found mc_v12-4GeV-3e-inclusive_run1310220_t1601591602.root
None
Found mc_v12-4GeV-3e-inclusive_run1310246_t1601592829.root
None
Found mc_v12-4GeV-3e-inclusive_run1310247_t1601591229.root
None
Found mc_v12-4GeV-3e-inclusive_run1310248_t1601591216.root
None
Found mc_v12-4GeV-3e-inclusive_run1310249_t1601591416.root
None
Found mc_v12-4GeV-3e-inclusive_run1310250_t1601591472.root
None

so the bash script finds it but the python does not

UPDATE: Solved

open(file) as f

finds the file. Don't know why this works but not the other, but whatever

CodePudding user response:

import os

def findfile(name, path):
    for dirpath, dirname, filename in os.walk(path):
        if name in filename:
            return os.path.join(dirpath, name)
filepath = findfile("file2.txt", "/")
print(filepath)

CodePudding user response:

I can use:

while open(file) as f:
     do stuff

Dont know why this works and not

path.exists

or

def findfile(name, path):
    for dirpath, dirname, filename in os.walk(path):
        if name in filename:
            return os.path.join(dirpath, name) 

but whatever, as long as it works it is fine.

  • Related