I am rotating images and pasting them on a screen-sized (4K) canvas using the following code but this implementation takes more than a hundred milliseconds per image rotated and pasted. The program I am using this in needs to do this a lot so it would be beneficial to speed things up and I also presume given this is a pretty standard kind of operation that this code is very optimizable. I would be grateful for any guidance on how to optimize it.
It may be relevant to mention that the various rotated images are usually quite close and sometimes overlap which is why I am doing the masking but that is one of the places that I think maybe I am being inefficient.
import cv2
import numpy as np
canvas = np.zeros((2160, 3840, 3), dtype=np.uint8)
img_path = PATH_TO_IMAGE
image = cv2.imread(img_path)
offset_from_center = 10
rotation_angle = 45
width = image.shape[1]
pivot_point = (width/2, offset_from_center)
rotation_mat = cv2.getRotationMatrix2D(pivot_point, -rotation_angle, 1.)
canvas_height = canvas.shape[0]
canvas_width = canvas.shape[1]
rotation_mat[0, 2] = canvas_width/2 - pivot_point[0]
rotation_mat[1, 2] = canvas_height/2 - pivot_point[1]
rotated_image = cv2.warpAffine(image,
rotation_mat,
(canvas_width, canvas_height))
alpha = np.sum(rotated_image, axis=-1) > 0
alpha = alpha.astype(float)
alpha = np.dstack((alpha, alpha, alpha))
rotated_image = rotated_image.astype(float)
canvas = canvas.astype(float)
foreground = cv2.multiply(alpha, rotated_image)
canvas = cv2.multiply(1.0 - alpha, canvas)
canvas = cv2.add(foreground, canvas)
canvas = canvas.astype(np.uint8)
CodePudding user response:
A quite profile shows that np.sum(rotated_image, axis=-1)
is particularly slow. Other next operations a a bit slow too, especially the array multiplications and the dstack
.
Numpy-based optimization
The first thing to know is that np.sum
will automatically convert the type of the array to a wider type before doing the reduction. In practice a 64-bit int. This mean a 8 time bigger array to reduce in memory so a significant slowdown. That being said, the biggest issue is that Numpy is not optimized for reducing very small dimension: it iterates over each array line using a quite inefficient approach and the iteration is far more expensive than adding 3 integers. On solution is to do a manual addition to significantly speed up this. Here is an example:
tmp = rotated_image.astype(np.uint16)
alpha = (tmp[:,:,0] tmp[:,:,1] tmp[:,:,2]) > 0
Note that this is sub-optimal and using Cython or Numba can speed this up even more (by a large margin). Another alternative is to operate on a transposed image.
Then, when you do alpha.astype(float)
, 64-bit floats are use which takes a lot of memory so any manipulation is slow. 64-bit floats are clearly not required here. 32-bit floats can be used instead. In fact, PC GPUs nearly always use 32-bit float to compute images since 64-bit operations are much more expensive (they requires a lot of energy and a lot more transistors).
np.dstack
is not needed since Numpy can make use of broadcasting (with alpha[:,:,None]
) so to avoid big temporary arrays like this. Unfortunately, Numpy broadcasting slow things down in practice...
cv2.multiply
can be replaced by np.multiply
which has a special parameter out
for faster in-place operations (30% faster on my machine). The cv2.multiply
also has an equivalent dst
parameter as pointed out by @ChristophRackwitz. Both functions runs equally fast on my machine.
Here is a faster implementation summing up all of this:
tmp = rotated_image.astype(np.uint16)
alpha = (tmp[:,:,0] tmp[:,:,1] tmp[:,:,2]) > 0
alpha = alpha.astype(np.float32)
alpha = np.dstack((alpha, alpha, alpha))
rotated_image = rotated_image.astype(np.float32)
canvas = canvas.astype(np.float32)
foreground = np.multiply(alpha, rotated_image)
np.subtract(1.0, alpha, out=alpha)
np.multiply(alpha, canvas, out=canvas)
np.add(foreground, canvas, out=canvas)
canvas = canvas.astype(np.uint8)
Thus, the optimized solution is 4 times faster. It is still far from being fast but this push a bit Numpy to its limits.
Speeding the code with Numba
Creating many big temporary arrays and operating on small dimensions is far from being efficient. This can be solved by computing pixel in-flight in a basic loop optimized by a (JIT) compiler like Numba or Cython so to produce a fast native code. Here is an implementation:
import numba as nb
@nb.njit('(uint8[:,:,::1], uint8[:,:,::1])', parallel=True)
def compute(img, canvas):
for i in nb.prange(img.shape[0]):
for j in range(img.shape[1]):
ir = np.float32(img[i, j, 0])
ig = np.float32(img[i, j, 1])
ib = np.float32(img[i, j, 2])
cr = np.float32(canvas[i, j, 0])
cg = np.float32(canvas[i, j, 1])
cb = np.float32(canvas[i, j, 2])
alpha = np.float32((ir ig ib) > 0)
inv_alpha = np.float32(1.0) - alpha
cr = inv_alpha * cr alpha * ir
cg = inv_alpha * cg alpha * ig
cb = inv_alpha * cb alpha * ib
canvas[i, j, 0] = np.uint8(cr)
canvas[i, j, 1] = np.uint8(cg)
canvas[i, j, 2] = np.uint8(cb)
compute(rotated_image, canvas)
Here are performance results on my 6-core machine (repeated 7 times):
Before: 0.427 s 1x
Optimized Numpy: 0.113 s ~4x
Numba: 0.0023 s ~186x
As we can see, the Numba implementation is much much faster than the optimized Numpy implementation. If this is not enough, you can compute this on the GPU which is more efficient for such task. It should be much faster and use less energy because GPUs have dedicated units for doing alpha-blending (and wider SIMD units). One solution is to use CUDA which only works on Nvidia GPUs. Another solution is to use OpenCL or even OpenGL for this. This should be a more complex than using Numba though.