When is .repeat used when loading a Tensorflow dataset for training?-CodePudding

I have seen tutorials which use .repeat() while doing the loading, shuffling, mapping, batching, prefetching etc. for a Tensorflow dataset while there are others that completely skip it.

I know what repeat does and how it is used, but am not able to figure out when it is used and when it is not.

Any help?

CodePudding user response：

It depends. Let's use MNIST as an example. Say we build a dataset using from_tensor_slices. The training dataset has 60000 samples.

Let's say we use batch size 100 and do not use repeat. This means the dataset will provide 600 batches. Now, if we try to train a model, for example using the keras fit interface, the dataset will simply run out of samples after 600 steps! We will not be able to train more than that. Using repeat, the dataset will instead simply "start fresh" once it runs out, and we can train as long as we like.

Other tutorials might use a manual training loop. Perhaps you have a loop like

for batch in data_set:
    ...

In this example, once again, the loop will simply stop after 600 batches if we do not use repeat. However, we can do this:

for epoch in range(n_epochs):
    for batch in data_set:
        ...

In this example, we specify the number of passes over the dataset in n_epochs. The inner loop stops after 600 batches, but then the outer loop (epoch) simply increments by 1, and the inner loop starts again. This way, we can have more than 600 batches even without using repeat.

Finally, there are of course other ways to create datasets. For example, from_generator can be used to stream a dataset from a Python generator that can run infinitely long, so repeat is not necessary at all.

Without having seen the tutorials you are referring to, I can only guess that the differences regarding use of repeat can be explained by differences in how the training loop is coded, such as the above.