How to loop through computer vision output array-CodePudding

A computer vision model output blob is identified like this: [image_id, label, conf, x_min, y_min, x_max, y_max], where:

image_id - ID of the image in the batch
label - predicted class ID (0 - person)
conf - confidence for the predicted class
(x_min, y_min) - coordinates of the top left bounding box corner
(x_max, y_max) - coordinates of the bottom right bounding box corner

Just manually printing the array of in the code I am working with a print(result_blob.tolist())

The output looks like this:

[0.0, 0.0, 0.6192154288291931, 0.36988523602485657, 0.39582735300064087, 0.7380738258361816, 0.9911962151527405]

If I wanted to draw a bounding box with open CV around a person in the video feed with a confidence greater than .5 how do I loop through these elements in an numpy array to do so?

This repository also contains some code and a function to do this but I dont understand how it works or if there is a better way to write it? Can someone help me understand what detections = result.reshape(-1, 7) does and the Python enumerate below it? the cv2 drawing rectangles I get but the function is confusing on why the convert floats to ints, etc...

def postprocess(result, image, fps=None):
    """
    Define the postprocess function for output data
    
    :param: result: the inference results
            image: the orignal input frame
            fps: average throughput calculated for each frame
    :returns:
            image: the image with bounding box and fps message
    """
    detections = result.reshape(-1, 7)
    for i, detection in enumerate(detections):
        _, image_id, confidence, xmin, ymin, xmax, ymax = detection
        if confidence > 0.5:
            xmin = int(max((xmin * image.shape[1]), 10))
            ymin = int(max((ymin * image.shape[0]), 10))
            xmax = int(min((xmax * image.shape[1]), image.shape[1] - 10))
            ymax = int(min((ymax * image.shape[0]), image.shape[0] - 10))
            cv2.rectangle(image, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
            cv2.putText(image, str(round(fps, 2))   " fps", (5, 20), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 3) 
    return image

CodePudding user response：

The function expects result to be a numpy array with a certain shape. If there are n person detections the rest of the code expects an array with shape (n, 7), so an array with rows of length 7. This is because there are 7 pieces of information per detection. Thanks to the reshape the function is lenient wrt the passed shape, like (n, 7) and (n * 7) and (1, n, 7) could work.

Making sure the shape is (n, 7) allows an easy loop over the detections. Because in general a loop over the rows of a 2d numpy array can be written as:

for row in array:
    ...

Using enumerate on top of this just gives the index of the row. However, this index isn't used in this case so the enumerate is useless here. Probably this loop was copied from another postprocessing function and it was left in by accident.

The unpacking of the detection has image_id as the second element, so maybe another small error. Since image_id is also not used it may be nicer to write:

confidence, xmin, ymin, xmax, ymax = detection[2:]

The 4 lines that convert xmin etc do two things:

Convert from relative coordinates to pixel coordinates (which have to be integers)
Make sure the rectangle isn't drawn within 10 pixels of the edge of the image