I am training PyTorch models on various datasets. The datasets up to this point have been images so I can just read them on the fly when needed using cv2 or PIL which is fast.
Now I am presented with a dataset of tensor objects of shape [400, 400, 8]. In the past I have tried to load these objects using PyTorch and NumPy's built-in tensor reading operations but these are generally much slower than reading images.
The objects are currently stored in h5py compressed files where there are ~800 per file. My plan was to save the objects individually in some format and then read them on the fly but I am unsure of what format to save them in which is fastest.
I would like to avoid keeping them all in memory as I believe the memory requirement would be too high.
CodePudding user response:
If the data arrays are still "images", just 8-channel ones, you can split them into 3 image files
a = x[:, :, 0:3]
b = x[:, :, 3:6]
c = x[:, :, 5:8]
c[:, :, 0] = 0 # reduces the compressed size
and store them using the conventional image libraries (cv2 and PIL).
Images compress much better than general data (lossy 'jpeg' even more so), and thefore that reduces both the disk space and bandwidth, and has file system caching benefits.