Home > database >  Getting error "Resource exhausted: OOM when allocating tensor with shape[1800,1024,28,28] and t
Getting error "Resource exhausted: OOM when allocating tensor with shape[1800,1024,28,28] and t

Time:10-17

I am getting a resource exauhsted error when initiation training for my object detection Tensorflow 2.5 GPU model. I am using 18 training images and 3 test images. The pre-trained model I am using is the Faster R-CNN ResNet101 V1 640x640 model from Tensorflow zoo 2.2. I am using a Nvidia RTX 2070 with 8 GB dedicated memory to train my model.

The thing I am confused about is why the training process is taking up so much memory from my GPU when the training set is so small. This is the summary of GPU memory I get along with the error:

Limit:                      6269894656
InUse:                      6103403264
MaxInUse:                   6154866944
NumAllocs:                        4276
MaxAllocSize:               5786902272
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

I also decreased the batch size of the training data to 6, and of the testing data to 1.

CodePudding user response:

I use the code below in all notebooks where I run on gpu, to prevent this type of error:

    import tensorflow as tf

    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
      try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
          tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
      except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process.

More information about using gpu with tensorflow here

Maybe it will solve the error

I hope I have helped you

CodePudding user response:

Max memory usage during training is impacted by several factors and reducing the batch size is typically how to address memory constraints. Alexandre Leobons Souza's recommendation may help as well by giving Tensorflow more flexibility in allocating memory, but if you continue to see OOM errors, then I would recommend reducing batch size further. Alternatively, you could try limiting the trainable variables in the model, which will also result in lower memory usage during training.

You mentioned, "The thing I am confused about is why the training process is taking up so much memory from my GPU when the training set is so small.". Something to keep in mind is that during training, your training data will be used in a forward pass through the model and then you will calculate gradients for each trainable variable in a backwards pass. Even if your training data is small, the intermediary calculations (including the gradients) require memory. These calculations scale linearly with respect to your batch size and the model size. By reducing batch size or by reducing the number of trainable variables, training will require less memory.

One other suggestion, if the size of your input tensor is changing in your training data (i.e. if the number of ground truth bounding boxes goes from 1 to 2 and you are not padding the input tensor), this can cause Tensorflow to retrace the computation graph during training and you will see warnings. I'm not certain the impact to memory in this case, but suspect that each retrace effectively requires a duplicate model in memory. If this is the case you can try using @tf.function(experimental_relax_shapes=True).

  • Related