Lazily load files at random from large directory-CodePudding

I have about a million files in my directory, and their number is likely to grow. For machine learning, I would like to randomly sample from those files without replacement. How can I do this very quickly? os.listdir(path) is too slow for me.

CodePudding user response：

I have about a million files in my directory ... os.listdir(path) is too slow for me.

This is the core of your problem, and it's solved by a technique I've generally heard referred to as bucketing your files, though a web search for this doesn't seem particularly helpful.

Bucketing is generally used by programs that need to store a large number of files that don't have any particular structure - for example, all the media files (such as images) in a MediaWiki instance (the software that runs Wikipedia). Here's the Stack Overflow logo on Wikipedia:

https://upload.wikimedia.org/wikipedia/commons/0/02/Stack_Overflow_logo.svg

See that 0/02 in the URL? That's the bucket. All files in Wikipedia will be hashed by some algorithm - for example sha256, though it won't necessarily be this - and 02 will be the first two hex digits of that hash. (The 0 before the slash is just the first digit of 02; in this case it's used as a second level of bucketing.)

If MediaWiki just stored every single file in one massive directory, it'd be very slow to access the files in that directory, because although OS folders can hold arbitrarily many files, they just aren't designed to hold more than a few thousand or so. By hashing the contents of the file, you get what looks like a random string of hex digits unique to that file, and if you then put all the files that start with the same first two hex digits (like 02 in a folder called 02, you get 256 folders (one for each possible value of the first two hex digits), and critically, each of those 256 folders contains a roughly equal number of files.

When you're trying to look up particular files, like MediaWiki is, you obviously need to know the hash to get to the file, if you store it in this way. But in your case, you just want to load random files. So this will work just as well:

Hash all your files and bucket them (possibly with additional levels, e.g. you might want files like 12/34/filename.ext, so that you have 65,536 buckets). You can use things like hashlib or command-line tools like sha256sum to obtain file hashes. You don't need to rename the files, as long as you group them into directories based on the first few hex digits of their hashes.
Now, each time you want a random file, choose a random bucket (and possibly random sub-buckets, if you're using additional levels), then choose a random file within that bucket.

Doing that will be a lot faster than using listdir on a directory with a million files and then choosing randomly among those.

_{Note: I'm just using MediaWiki as an example here because I'm familiar with a few of its internals; lots of software products do similar things.}