NumPy filling values inside given bounding box coordinates for a large array-CodePudding

I have a very large 3d array

large = np.zeros((2000, 1500, 700))

Actually, large is an image but for each coordinate, it has 700 values. Also, I have 400 bounding boxes. Bounding boxes do not have a fixed shape. I store a tuple of lower and upper bound coordinates for each box as follows

boxes_y = [(y_lower0, y_upper0), (y_lower1, y_upper1), ..., (y_lower399, y_upper399)]
boxes_x = [(x_lower0, x_upper0), (x_lower1, x_upper1), ..., (x_lower399, x_upper399)]

Then, for each box, I want to fill the corresponding region in large array with a vector of size 700. Specifically, I have an embeddings array for each box

embeddings = np.random.rand(400, 700) # In real case, these are not random. Just consider the shape

What I want to do is

for i in range(400):
   large[boxes_y[i][0]: boxes_y[i][1], boxes_x[i][0]: boxes_x[i][1]] = embeddings[i]

This works but it is too slow for such a large large array. I am looking for vectorizing this computation.

CodePudding user response：

One big problem is that the input is really huge (~15.6 GiB). Another is that it is travelled up to 400 times in the worst case (resulting in up to 6240 GiB to be written in RAM). The thing is that overlapping regions written multiple times.

A better solution is to iterate over the two first dimensions (the one of the "image") to find which bounding box must be copied as proposed by @dankal444. This is similar to what Z-buffer-based algorithms does in computer graphics.

Based on this, an even better solution is to use a scanline-rendering algorithm. In your case, the algorithm is much simpler than the traditional one since you are working with bounding-boxes and not complex polygons. For each scanline (2000 here), you can quickly filter the bounding box that are written to the scanline and then iterate over them. The classical algorithm is a bit too complex for your simple case. For each scanline, iterating over the filtered bounding-box and overwriting the their indices in each pixel is enough. This operation can be done using Numba in parallel. It is very fast because the computation is mainly performed in the CPU cache.

The final operation is to perform the actual data writes based on the previous indices (still using Numba in parallel). This operation is still memory bound, but the output array is only written once (only 15.6 GiB of RAM will be written in the worst case, and 7.8 GiB for float32 items). This should take a fraction of a second on most machine. If this is not enough, you could try to use dedicated GPUs since the GPU RAM is often significantly faster than the main RAM (typically about an order of magnitude faster).

Here is the implementation:

# Assume the last dimension of `large` and `embeddings` is contiguous in memory
@nb.njit('void(float32[:,:,::1], float32[:,::1], int_[:,::1], int_[:,::1])', parallel=True)
def fastFill(large, embeddings, boxes_y, boxes_x):
    n, m, l = large.shape
    boxCount = embeddings.shape[0]
    assert embeddings.shape == (boxCount, l)
    assert boxes_y.shape == (boxCount, 2)
    assert boxes_x.shape == (boxCount, 2)
    imageBoxIds = np.full((n, m), -1, dtype=np.int16)
    for y in nb.prange(n):
        # Filtering -- A sort is not required since the number of bounding-box is small
        boxIds = np.where((boxes_y[:,0] <= y) & (y < boxes_y[:,1]))[0]
        for k in boxIds:
            lower, upper = boxes_x[k]
            imageBoxIds[y, lower:upper] = k
    # Actual filling
    for y in nb.prange(n):
        for x in range(m):
            boxId = imageBoxIds[y, x]
            if boxId >= 0:
                large[y, x, :] = embeddings[boxId]

Here is the benchmark:

large = np.zeros((1000, 750, 700), dtype=np.float32)  # 8 times smaller in memory
boxes_y = np.cumsum(np.random.randint(0, large.shape[0]//2, size=(400, 2)), axis=1)
boxes_x = np.cumsum(np.random.randint(0, large.shape[1]//2, size=(400, 2)), axis=1)
embeddings = np.random.rand(400, 700).astype(np.float32)

# Called many times
for i in range(400):
   large[boxes_y[i][0]:boxes_y[i][1], boxes_x[i][0]:boxes_x[i][1]] = embeddings[i]

# Called many times
fastFill(large, embeddings, boxes_y, boxes_x)

Here are results on my machine:

Initial code:        2.71 s
Numba (sequential):  0.13 s
Numba (parallel):    0.12 s   (x22 times faster than the initial code)

Note that the first run is slower because of virtual zero-mapped memory. The Numba version is still about 10 times faster in this case.