I have a dataset of images that is too large to store on memory. What I plan to do is loading pairs of the paths to the images and corresponding labels as my dataset, then use a generator function during training to convert only the paths in my batch to images before feeding them to the network.
Is data.Dataset.map() a good way to do this? Does it return a mapping function, that can be applied only to the current batch during training, or does it perform the mapping operation on the whole dataset at once, occupying lots of memory? In the second case, what is an alternative?
A few tutorials I went through made me believe the mapping takes place per batch, but this quote from the documentation suggests a whole new dataset is returned: "This transformation applies map_func to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input."
CodePudding user response:
The key thing to understand here is that tf.data.Dataset
objects are generally "lazy" in that elements are only processed as needed (in a batched Dataset, elements == batches). When iterating over a dataset, this usually means that only the next requested element is prepared and then returned. So to answer your question: When using map
to load data from disk, and applying this to a dataset of file names, only one batch of the loaded data should be stored in memory at the same time, and you should be able to process the dataset just fine. However, this can significantly slow down training if loading the files is a bottleneck in terms of speed.
There are some exceptions though, for example:
- When you use the
shuffle
method, you need to provide a buffer size, and AFAIK the entire buffer is preprocessed at once. This can lead to issues since you want a large buffer for good shuffling, but this requires more memory. Thus you probably want to useshuffle
before applyingmap
. - The
prefetch
method results in multiple elements being prepared in order to avoid the model having to wait for the next batch to be processed.
Note that this lazy behavior also has some disadvantages, e.g.
- You can only iterate over datasets sequentially; there is no random access.
- A dataset doesn't even know how many elements it contains (this would require iterating over the entire set).