looping over model inference that has 2 outputs-CodePudding

I am working with the openVINO zoo that has lots of pre-trained models. How does one handle in the post process of the data after inference when the model has 2 outputs, like this model? I am only looking to draw bounding boxes around people in the video or picture frame.

Output 1:

The boxes is a blob with the shape 100, 5 in the format N, 5, where N is the number of detected bounding boxes. For each detection, the description has the format: [x_min, y_min, x_max, y_max, conf]

Output 2:

The labels is a blob with the shape 100 in the format N, where N is the number of detected bounding boxes. In case of person detection, it is equal to 1 for each detected box with person in it and 0 for the background.

I am confused if I need to account for one or the other model outputs or both to come up with bounding box coordinates of people detected?

This is a function below from the openVINO notebooks for people detection for post processing the data...but I know it doesnt accommodate a model that has 2 outputs. Any tips and/or pseudocode greatly appreciated.

def postprocess(result, image):
    """
    Define the postprocess function for output data

    :param: result: the inference results
            image: the orignal input frame
            fps: average throughput calculated for each frame
    :returns:
            image: the image with bounding box and fps message
    """
    aligns = image.shape
    align_bottom = aligns[0]
    align_right = (aligns[1]/1.7)

    detections = result.reshape(-1, 5)
    for i, detection in enumerate(detections):
        #_, image_id, confidence, xmin, ymin, xmax, ymax = detection
        xmin, ymin, xmax, ymax, confidence = detection
        if confidence > 0.2:
            xmin = int(max((xmin * image.shape[1]), 10))
            ymin = int(max((ymin * image.shape[0]), 10))
            xmax = int(min((xmax * image.shape[1]), image.shape[1] - 10))
            ymax = int(min((ymax * image.shape[0]), image.shape[0] - 10))

            conf = round(confidence, 2)
            print(f"conf: {conf:.2f}")
            print((xmin, ymin),(xmax, ymax))

            
            # For bounding box
            cv2.rectangle(image, (xmin, ymin),
                          (xmax, ymax), (255, 255, 255), 5)

            # For the text background
            # Finds space required
            (w, h), _ = cv2.getTextSize(
                f"{conf:.2f}", cv2.FONT_HERSHEY_SIMPLEX, 1.7, 1)

            # Prints the text.
            cv2.rectangle(image, (xmin, ymin   h   5),
                          (xmin   w   5, ymin), (255, 255, 255), -1)
            cv2.putText(image, f"{conf:.2f}", (xmin, ymin   h),
                        cv2.FONT_HERSHEY_SIMPLEX, 1.7, (0, 0, 0), 3)

    return image

CodePudding user response：

So as stated in the description:

In case of person detection, it is equal to 1 for each detected box with person in it and 0 for the background.

you may need to take both outputs into account because output2 may contain a background class. You can reuse most part of the your code:

# output1, output2 = model()

for i, (detection, label) in enumerate(zip(output1, output2)):
        #_, image_id, confidence, xmin, ymin, xmax, ymax = detection
        xmin, ymin, xmax, ymax, confidence = detection
        if confidence > 0.2 and label == 1:
            # reuse the code from the example