MIDV 500 document localization: fitting problem-CodePudding

I've been experimenting with a part of MIDV 500 dataset, tried to localize document quadrilateral. So, my output is a vector of 8 floats.

RGB images were scaled to 960 by 540 pixels (960, 540, 3), pixel values were scaled to [0..1]. Target vector also scaled to [0..1] (simply divided by image dims)

My first approach was pretrained CNN ( fine-tuning) from Keras applications (tried EfficientNetB0-2) with following Dense head:

effnet = EfficientNetB0(weights="imagenet", include_top=False, input_shape=(540, 960, 3))
# effnet.trainable = False

for layer in effnet.layers:
  if 'block7a' not in layer.name and 'top' not in layer.name:
    layer.trainable = False

model = Sequential()

model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(64, activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(32, activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(8, activation='sigmoid'))

opt = Nadam(learning_rate=0.001)
model.compile(metrics=[iou_metric], loss=iou_loss, optimizer=opt)

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.6, patience=10)
early_stopping = EarlyStopping(monitor='val_loss', patience=25)

Added a few callbacks, used Nadam as optimizer and Jaccard score as loss and metric.

Loss graph

As we can see from the loss graph, training loss reduces through the training process. Though, the val loss is acting strange.

After ~30 epochs model.predict() gives set of same vectors ([0., 0., 1., 1., 0., 0., 0., 1.] * shape of val_x). I am not sure whether it's overfitting or underfit (seems like wrond approach though).

So, would you be so kind to tell me what am I doing wrong? I've tried a few different loss functions, double-checked my data before and after scaling to [0..1]. Will try something like UNet for segmentations approach, though localization seems correct.

CodePudding user response：

Two things:

Please check which version of TensorFlow (TF) you are using. I believe that from 2.5, you don't need to rescale the input image to the range [0-1]. The network expects tensors from [0-255]. https://keras.io/api/applications/efficientnet/
Your model architecture and callbacks look all right (I am not an expert on this optimizer loss though). Thus, I am assuming that the problem might come from your data input. Are you using ImageDataGenerator as input and for splitting the data from training and validation? If not, it might be worth a try. You can specify your validation subset and the generator will split the data for you. More info here: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator

CodePudding user response：

descent. Finally, a healthy gradient descent. So, the solution:

As @Nicolas Perez de Olaguer mentioned before, I should have used non-scaled image data (regular RBG images with 3 channels scaled to [0..255])
I've been trying to solve localization problem, which is quadrilateral regression where you predict quadrilateral nodes coordinates (different from bbox regression, where you predict vector of 4). And I've decided to use modern IOU loss (Jaccard loss) which is typically used for bbox regression and expect vector of 4 as input, though my vectors had 8 elements and loss function wasn't calculating correctly because of problem specifics.
Also, my model had a few design flows that yall should learn from. First, I've added to much dense layers, 1 or 2 is enough (more will just slow the training process). Second, I've selected too big image resolution, which again only slows the training process (big white quadrilateral localization is an overeasy problem for resolution that big). And, finally, I haven't thought about the ImageNet weights, which sometimes may descrease training quality due to highly atypical data (if you are trying to use pretrained CNN on the objects it has never seen).

Thanks for your attention and help. I'll mark this answer as solution.