How to multiprocess lots of small tasks-CodePudding

I have array of 1000-2000 lists, and I need to pass each of that list into function. Each function call takes approximately 0.002s. I already tried using concurrent.futures.ProcessPoolExecutor and multiprocessing Pool. Each of that approach is 1.5x-2x slower than simply iterate over big array in a linear one-process way. Also, I've tried splitting big array of lists into several smaller arrays of lists, but, it isn't done job faster too, only slower. I assume, the reason behind that is switching between processes, but, theoreticly, splitting big array into chunks should've been fix that.

Is there any way to do this faster?

Lists are lines of image (numpy arrays) with values 0 or 255 (black and white pixels)

Function is detecting indexes of ranges of black lines on single line

[0, 0, 0, 255, 255, 0, 255, 255,255] -> ((0, 2), (5, 5))

Code for testing (objects_on_line needs to be parallel):

import numpy as np
from time import perf_counter
import cv2

OBJ_COLOR = 0

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield tuple(lst[i:i   n].tolist())

def objects_on_line(line):
    line_original = np.array(line)
    line = np.where(line_original==OBJ_COLOR)[0]
    mask=[0]*len(line)

    for i, el in enumerate(line[:-1]):
        if line[i 1] - el == 1 and (i 2 < len(line) and line[i 2] - el == 2):
            mask[i 1] = 1

    line = np.ma.array(line, mask=mask).compressed()

    i = 0
    for _ in line[:-1]:
        if i 1 >= len(line):
            break

        lv = line[i]
        ls = line_original[lv:line[i 1]]

        if len(np.where(ls != OBJ_COLOR)[0]) != 0:
            line = np.insert(line, i, lv)

        i  = 2

    if len(line) % 2 != 0:
        line = np.insert(line, len(line), line[-1])

    line = list(chunks(line, 2))

    return line



img = cv2.cvtColor(cv2.imread("test_people.png"), cv2.COLOR_RGB2GRAY)
img_bin = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY)

s = perf_counter()

for line in img_bin:
    objects_on_line(line)

print(f"Done in: {perf_counter()-s}")

CodePudding user response：

Your only option is to rewrite the function to accept more elements and reword the algorithm to reduce wasted work.

CodePudding user response：

line_original = np.array(line)

Don't do that. Change the API you define, so there is an objects_on_lines(image: np.array, lines: tuple[int]) function.

You want to put all the image pixels into a single numpy array and then leave them there. Pass in a (start, end) range of indexes for a core to chew on. (Heck, you could even pass in a range object.)

Avoid copying into a newly created list. Just use the pixels where they are. Point at them with indexes.

It's not clear that your system will be able to achieve 12x speedup, even if coded in Rust or C . It might be the case that a handful of cores can max out your main memory bandwidth. Increasing the multiprogramming level beyond that, to 12 or 24, would be counter productive.

Take care to represent pixels compactly. At least use uint8. If possible use a bit vector, for an 8x reduction in memory bandwidth requirements.

Consider staggering your offsets within the array. If all threads are hitting virtual memory addresses which conflict for the same cache line, then with 4-way associativity there's a fairly low limit on how many threads can productively contend for fresh pixels.