I am working with the openVINO zoo that has lots of pre-trained models. How does one handle in the post process of the data after inference when the model has 2 outputs, like this model? I am only looking to draw bounding boxes around people in the video or picture frame.
Output 1:
The boxes is a blob with the shape 100, 5 in the format N, 5, where N is the number of detected bounding boxes. For each detection, the description has the format: [x_min, y_min, x_max, y_max, conf]
Output 2:
The labels is a blob with the shape 100 in the format N, where N is the number of detected bounding boxes. In case of person detection, it is equal to 1 for each detected box with person in it and 0 for the background.
I am confused if I need to account for one or the other model outputs or both to come up with bounding box coordinates of people detected?
This is a function below from the openVINO notebooks for people detection for post processing the data...but I know it doesnt accommodate a model that has 2 outputs. Any tips and/or pseudocode greatly appreciated.
def postprocess(result, image):
"""
Define the postprocess function for output data
:param: result: the inference results
image: the orignal input frame
fps: average throughput calculated for each frame
:returns:
image: the image with bounding box and fps message
"""
aligns = image.shape
align_bottom = aligns[0]
align_right = (aligns[1]/1.7)
detections = result.reshape(-1, 5)
for i, detection in enumerate(detections):
#_, image_id, confidence, xmin, ymin, xmax, ymax = detection
xmin, ymin, xmax, ymax, confidence = detection
if confidence > 0.2:
xmin = int(max((xmin * image.shape[1]), 10))
ymin = int(max((ymin * image.shape[0]), 10))
xmax = int(min((xmax * image.shape[1]), image.shape[1] - 10))
ymax = int(min((ymax * image.shape[0]), image.shape[0] - 10))
conf = round(confidence, 2)
print(f"conf: {conf:.2f}")
print((xmin, ymin),(xmax, ymax))
# For bounding box
cv2.rectangle(image, (xmin, ymin),
(xmax, ymax), (255, 255, 255), 5)
# For the text background
# Finds space required
(w, h), _ = cv2.getTextSize(
f"{conf:.2f}", cv2.FONT_HERSHEY_SIMPLEX, 1.7, 1)
# Prints the text.
cv2.rectangle(image, (xmin, ymin h 5),
(xmin w 5, ymin), (255, 255, 255), -1)
cv2.putText(image, f"{conf:.2f}", (xmin, ymin h),
cv2.FONT_HERSHEY_SIMPLEX, 1.7, (0, 0, 0), 3)
return image
CodePudding user response:
So as stated in the description:
In case of person detection, it is equal to 1 for each detected box with person in it and 0 for the background.
you may need to take both outputs into account because output2
may contain a background class. You can reuse most part of the your code:
# output1, output2 = model()
for i, (detection, label) in enumerate(zip(output1, output2)):
#_, image_id, confidence, xmin, ymin, xmax, ymax = detection
xmin, ymin, xmax, ymax, confidence = detection
if confidence > 0.2 and label == 1:
# reuse the code from the example