I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
CodePudding user response:
Usually working with one bigger file is faster than working with many small files.
It needs less open
, read
, close
, etc. functions which need time to
- check if file exists,
- check if you have privilege to access this file,
- get file's information from disk (where is beginning of file on disk, what is its size, etc.),
- search beginning of file on disk (when it has to read data),
- create system's buffer for data from disk (system reads more data to buffer and later function
read()
can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.