apply dataset.repeat of prefetched dataset-CodePudding

I'm trying to implement AdaRound quantization algorithm and I need to train my layers one by one.

I'm using a dataset with 1024 with batch-size of 32 and I need to iterate over the dataset roughly 312 epochs (or 10k iteration over a batched dataset), I've noticed that the data is copy from the Host to the device every iteration and that the data is not cached on the GPU (despite using the same data repeatedly) - the GPU is idle 30~40% percent of the time

Idle GPU percentage

The data is still copied from the host to the device on advanced iterations:

memcpyH2D chunk in single iteration

I've tried using

tf.data.experimental.prefetch_to_device

tf.data.experimental.copy_to_device

but when I'm iterating over the data after prefetch_to_device or copy_to_device the tensors are stored on the GPU, but if I use repeat to go over the dataset, the tensors are stored on the CPU

I tried using model.fit without dataset.repeat but with multiple epochs, but I get a similar behavior.

I also tried using model.fit with tensors that are stored on the GPU, but the Model's fit converts it to Dataset which forces the data back to the CPU

A code snippet to recreate the issue:

input_shape = (56, 56, 64)
output_shape = (54, 54, 64)
conv = tf.keras.layers.Conv2D(64, (3, 3))
mock_input = tf.keras.layers.Input(input_shape)
mock_output = conv(mock_input)
train_model = tf.keras.Model(inputs=mock_input, outputs=mock_output)

input_data = np.random.rand(1024, *input_shape)
output_data = np.random.rand(1024, *output_shape)

input_dataet = tf.data.Dataset.from_tensor_slices(input_data)
output_dataset = tf.data.Dataset.from_tensor_slices(output_data)

train_model.compile(
    optimizer='adam',
    loss='mse'
)
train_data = tf.data.Dataset.zip((input_dataet, output_dataset))
batched_train_data = train_data.batch(32).cache()
fetched_train_data = batched_train_data.prefetch(tf.data.AUTOTUNE).repeat()
with tf.profiler.experimental.Profile('logs'):
    train_model.fit(fetched_train_data, steps_per_epoch=1024, epochs=1)

Is there a way to apply the dataset.repeat operation on the GPU?

I'm using tensorflow 2.5.2 with python 3.6.9

CodePudding user response：

Detailed answer

nvidia has a package named Nvidia DALI, this packages offers an efficient wrapper to tensorflow's dataset (and more, but this is the relevant feature I used here), I had to install 2 packages - nvidia-dali-cuda110, nvidia-dali-tf-plugin-cuda110 (a detailed installation guide can be found here)

The class I've used is called DALIDataset, to insatiate it properly I first had to initialize pipeline object

single iteration when using properly initialized DALIDataset

Code snippet:

from nvidia.dali.plugin.tf import DALIDataset
from nvidia.dali import pipeline_def, fn

def prep_dataset_dali(dir1, dir2, batch_size):
    @pipeline_def(batch_size=batch_size, num_threads=3, device_id=0)
    def pipe(path1, path2):
        data1 = fn.readers.numpy(device='cpu', file_root=path1, file_filter='*.npy')
        data2 = fn.readers.numpy(device='cpu', file_root=path2, file_filter='*.npy')
        return data1.gpu(), data2.gpu()
    my_pipe = pipe(dir1, dir2)
    my_pipe.build()
    return DALIDataset(my_pipe, output_dtypes=(tf.float32, tf.float32), output_shapes=((None, 56, 56, 64), (None, 56, 56, 64)))

Note:

external pipeline doesn't work with DALIDataset but it might work with the DALIDatasetWithInputs class from the experimental section