For a project I implemented a simple background subtraction using a median background estimation. The result is not bad, but often moving objects (people in my test examples) are cut in unconnected blobs.
I tried calling open and close operations (I removed the close operation, because it seemed as if it wouldn't improve the result) on the foreground mask to improve the result, which worked to some degree. However, I am wondering if there are even more ways how I could improve the foreground mask. It is still a fair bit away from the ground truth.
I am aware that playing around with the threshold itself is also always a viable solution and I do play around with that too. That being said, I focus on reducing noise to a minimum. I also tried adaptive thresholding, but that didn't look very promising for this usecase.
Without opening:
I am more looking for general approaches than to actual implementations.
Pseudocode of background subtraction
Greyscale all images.
Make a background estimation by calculating the median for every r,g and b value for every pixel in a subset of all images.
Then take every image and calculate the absolute difference between that image and the background estimation.
Apply a threshold to get a binary result called the foreground mask
Use opencvs open operation once.
CodePudding user response:
I like the greyscale simplification. Simple is good. We should make everything as simple as possible, but not simpler.
Let's attack your model for a moment. An evil clothing designer with an army of fashion models sends them walking past your camera, each wearing a red shirt that is slightly darker than the preceding one. At least one of the models will be "invisible" against some of your background pixels, having worn a matching shade, with matching illumination, compared with the median pixel value. Repeat with a group of green shirts, then blue.
How to remedy this? In each channel compute the median red, median green, median blue pixel intensity. At inference time, compute three absolute value differences. Threshold on max of those deltas.
Computing over sensor R, G, B is straightforward. Human perception more closely aligns with H, S, V. Consider computing max delta over those three, or over all six.
For each background pixel, compute both expected value and variance, either for the whole video or for one-minute slots of time. Now, at inference time, the variance informs your thresholding decision, improving its accuracy.
For example, some pixels may have constant illumination, others slowly change with the movement of the sun, and others are quite noisy when wind disturbs the leaves of vegetation. The variance lets you capture this quite naturally.
For a much more powerful modeling approach, go with Optical Flow. https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html