Whats the performance difference between a python accessing multiple small .npy files vs one large .-CodePudding

I am currently working on a repository that has millions of small .npy numpy or .png image files. The code reads and writes these multiple small files.

It seems to be extremely slow, I was wondering if I merged all smaller .npy files to a larger files, would the code run faster? If yes, what would be the reason? Does it have something to do with disk I/O?

CodePudding user response：

Opening each file require the operating system to fetch some data on the storage device. More specifically multiple requests regarding the target file system. Reading a block also require a request. This means many request per file. Each file is typically stored at different locations (that often looks random in practice). Performing random fetches on storage devices is known to be slow. Storage devices have a limited number of IO operations per second (IOPS). HDD have a very low number of IOPS (eg. ~75 IOPS for mainstream 7200 RPM HDDs). SSDs are much faster but neither the hardware nor the software stacks are optimized to perform many requests sequentially yet so multiple threads are often needed to reach a good IOPS.

Thus, yes, merging many files into one will certainly improve performance, especially if you use a HDD, because you can then read big contiguous chunks.

For more information about the expected performance, please read this.