I am working with the Places205(resized) dataset. It is a classification dataset with around 2.7 million images and it is 131GB.
I am trying to upload this dataset to Hub—the dataset format for AI—and the dataset was uploading at around 5MB/s. After doing so I was able to load the dataset and around 2.4 million images were there.
Is it possible to make the uploading process faster?
I'm using Hub v2.2.2
CodePudding user response:
The speeds you are experiencing are consistent with expectations. Generally, Hub can upload datasets at ~10-15 MB/s single-threaded. It seems like you're running at ~5MB/s, which is roughly in the same ballpark. If you want to run multi-threaded, you can check out the methodology in the upload_parallel function in the script that is found on the Places305 GitHub example page. It uses multiprocessing to speed things up.
Btw, with Hub you can load the Places205 dataset with one line of code: ds = hub.load("hub://activeloop/places205")
. You are also able to visualize Places205 on Activeloop Platform.