tensorflow custom loop training model, multi-gpu is slower than single-gpu-CodePudding

tensorflow2.6；cuda11.2； 4GPUS(RTX3070)；

Tensorflow uses keras to define the training model, and multiple GPUs can accelerate normally. However, when using a custom loop training model, the batch_size(the memory will overflow if the multi-gpu setting is too large) setting is the same as that of a single gpu, and the model training speed is slower than that of a single gpu. Could not find a solution, anyone can help, thanks.

I have googled a lot, there was no satisfied solution, anyone has good idea, thanks very much!

CodePudding user response：

there are many possibilities

Input dataset, multiple units are working across device targets with their own handling speed and problem from the distributors. They had synchronized and assynchronized modes you applied config = tf.config.experimental.set_synchronous_execution( False )
The custom loop modes mean as its name execution guarantee modes, you need to handle the process with the program rather than model.fit() or estimator function.
Input data and label that you see from the example, you need to handles of data input by yourself even use the estimator()

CodePudding user response：

@Jirayu Kaewprateep

I'm not using keras to build this model. And my data generator worked well. Here is a piece of my code.

with mirrored_strategy.scope():
    model = tf.keras.Model(input_data, bbox_tensors)
    optimizer = tf.keras.optimizers.Adam()
    ckpts = tf.train.Checkpoint(optimizer=optimizer, model=model)

def training(inputs):
    """training part"""
    image_data, labels = inputs
    with tf.GradientTape() as tap:
        predictions = model(image_data, training=True)
        tloss = compute_loss(predictions, labels)
    gradients = tap.gradient(tloss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return tloss

@tf.function
def distributed_training(dataset_inputs):
    per_replica_losses = mirrored_strategy.run(training, args=(dataset_inputs, ))
    return mirrored_strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)