my problem is that i wan my program to say just the new object that appear in front of the camera and not keep saying and repeating the object it sees, like i want it to say 'person' just one time when it sees a person , and when something new came into the frame like a cup then it will say 'cup' just one time, I have been told that to do that i should catch the last object in the last frame in a variable so it will be the thing that will be spoken . but it is not clear to me how to do that in code , i am using opencv, yolov3, pyttsx3
import cv2
import numpy as np
import pyttsx3
net = cv2.dnn.readNet('yolov3-tiny.weights', 'yolov3-tiny.cfg')
classes = []
with open("coco.names.txt", "r") as f:
classes = f.read().splitlines()
cap = cv2.VideoCapture(0)
font = cv2.FONT_HERSHEY_PLAIN
colors = np.random.uniform(0, 255, size=(100, 3))
while True:
_, img = cap.read()
height, width, _ = img.shape
blob = cv2.dnn.blobFromImage(img, 1/255, (416, 416), (0,0,0), swapRB=True, crop=False)
net.setInput(blob)
output_layers_names = net.getUnconnectedOutLayersNames()
layerOutputs = net.forward(output_layers_names)
boxes = []
confidences = []
class_ids = []
for output in layerOutputs:
for detection in output:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.2:
center_x = int(detection[0]*width)
center_y = int(detection[1]*height)
w = int(detection[2]*width)
h = int(detection[3]*height)
x = int(center_x - w/2)
y = int(center_y - h/2)
boxes.append([x, y, w, h])
confidences.append((float(confidence)))
class_ids.append(class_id)
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.2, 0.4)
if len(indexes)>0:
for i in indexes.flatten():
x, y, w, h = boxes[i]
label = str(classes[class_ids[i]])
confidence = str(round(confidences[i],2))
color = colors[i]
cv2.rectangle(img, (x,y), (x w, y h), color, 2)
cv2.putText(img, label " " confidence, (x, y 20), font, 2, (255,255,255), 2)
engine = pyttsx3.init()
engine.say(label)
engine.runAndWait()
cv2.imshow('Image', img)
key = cv2.waitKey(1)
if key==27:
break
cap.release()
cv2.destroyAllWindows()
CodePudding user response:
There are whole PhD dissertations about this. A not just a few. That is a almost whole research domain by itself.
You need to track object. And yolo works on images, not sequences of images. From yolo point of view, there is no relation at all between what happens in one frame and what happens in another. Frames could be random, unrelated images, grabbed from a photo collection, it would be the same.
So you need a second stage after yolo, way more complicated, to track objects. And since yolo is not a sure thing (nor any other CNN detectors), that second stage cannot be a naive updating of a dictionary of detected object. Some object will be detected only in half of the frames, and you need to track them anyway. Some other (for example 2 pedestrians walking alongside) could be confused, because pedestrian B at frame t 1 is at the same position as pedestrian A was at frame t. And yet, you want to track them
I can't just give you some code it an answer. It is way bigger than what fits in a Q&A like this.
But I can point you to one, often used in conjonction with YOLO, algorithm to tracks objects, that is SORT, whose one implematation is this one.
You can find easily many description of how you can articulate YOLO SORT. Like this one (a very recent one. YOLO SORT was already a classic way before 2020, so I don't know what new this specific article did in 2022. Maybe nothing, all articles are not new, even if they should be. I just picked one randomly)
Or, since nowadays people often like videos instead of written documentation, you can watch this tutorial
But, well, how ever you do to learn this, you'll see that it is bigger than just asking on SO.