I have 2D dataset values.shape = (NxN)
and have 3 criteria. I want to make an image for those criteria with a color code. So I end up having a NxNx4 array mask
which holds for every pixel boolean values for (background, color 1, color2, color 3)
.
- If none of the criteria is fulfilled for a pixel, I set it to
[T, F, F, F]
, i.e. background. - If any of the criteria are fulfilled (e.g.
[F, T, F, T]
), then the background is automaticallyFalse
and then I want to randomly assign one of the according colors, so herecolor1
orcolor3
. If just one criterium isTrue
, that's the one that gets picked.
In other words: If a pixel fulfills one criterion, it gets the according color. If it matches several criteria, one is randomly picked.
Right now, I do
# assign a probability
p = mask/mask.sum(-1)[:, :, None]
# the indices of the colors
lst = np.arange(len(colors))
for i in range(mask.shape[0]):
for j in range(mask.shape[1]):
# randomly pick a color index where the mask is true
idx = np.random.choice(lst, p=p[i,j])
# assign that color to the pixel
img[i,j,:] = colors[idx, :]
I can do everything up to that point fast with numpy
expressions, but I cannot do this assignment without a double loop. Is there any way to avoid the double loop with some numpy
-magic?
CodePudding user response:
AFAIK, there is no way to efficiently vectorize np.random.choice
in this case using only Numpy (though there might be a faster way than 2 nested loops). The main issue with the provided implementation is that np.random.choice
takes about several dozens of microseconds per call so the overall computation takes at least several minutes. This overhead comes from:
- Checks: for example Numpy check the probability of all p-items is equal to 1 (in
O(len(p))
); - A slow Kahan summation in the Numpy implementation (that might be optimized).
A simple solution is to use Numba so to get ride of most of such overhead but Numba does not support the parameter p
. It is not easy to make the computation efficient with large p
arrays. One solution is to compute the cumulative sum of p
and pick random value (between 0 and 1) so to select an item of lst
with the probability in p
based on a binary search. In fact, it is not even needed to select an item of lst
since lst[i] == i
. Here is a Numba-based implementation:
# Faster equivalent of `np.random.choice(np.arange(len(p)), p=p)`
@nb.njit('(float64[::1],)')
def randomInt(p):
assert p.size > 0
if p.size == 1:
return 0
tmp = np.empty(p.size, np.float64)
# Cumulative sum
s = 0.0
for i in range(p.size):
s = p[i]
tmp[i] = s
# Renormalization (to increase the accuracy)
rnd = np.random.rand() * s
idx = np.searchsorted(tmp, rnd)
return idx
This function is about 22 times faster than the one of Numpy. However, it is a slighly less accurate if the standard deviation of the probability is huge (eg. most items set to 0.00001 and few to 0.2) or if the number of items is huge (eg. >10_000). That being said, a re-normalization should be enough to give very good results for thousands of items (and non-extreme standard deviations). Note that p
is automatically re-normalized.
You can use this function in a pure-Python code but it is faster to also use Numba so to speed up the two nested loops (by putting the code in a Numba function).
Further optimizations
On most processors, the speed of randomInt
is limited by the dependency chain of the =
operations on floating-point numbers. Indeed, it takes about 4 cycles on Skylake-related processors, 3 on previous architecture and also 3 on AMD Zen-based processors. This means that for 1000 items, the cumulative sum needs to takes at least 4000 cycles on Skylake (ie. 1 µs on a 4 GHz processor). An optimization is to break the dependency chain by the computing the even/odd parts of the cumulative sum with more independent instructions so the processor can better overlap the latency.
Another optimization is to parallelize the nested loops in Numba using prange
and the decorator argument parallel=True
. That being said, the scalability can be impacted by a contention of the allocator due to the many np.empty
calls. This issue can be fixed by pre-allocating tmp
in the outer loop and using prange
only for this loop. The parameter of randomInt
need to change so to include the temporary tmp
array in parameter.
Here is an optimized Numba function (untested):
@nb.njit('(float64[::1], float64[::1])')
def randomIntOpt(p, tmp):
assert p.size > 0
assert tmp.size == p.size
if p.size == 1:
return 0
s1, s2 = p[0], p[1]
tmp[0], tmp[1] = s1, s1 s2
# Optimized cumulative sum
# (unrolled so to break the dependency chain)
i1 = 2
i2 = 3
iMax = p.size
while i2 < iMax:
s1 = p[i1]
tmp[i1] = s1 s2
s2 = p[i2]
tmp[i2] = s1 s2
i1 = 2
i2 = 2
# Remainder
if p.size % 2 == 1:
s1 = p[i1]
tmp[i1] = s1 s2
s = s1 s2
# Renormalization (to increase the accuracy)
rnd = np.random.rand() * s
idx = np.searchsorted(tmp, rnd)
return idx
This function is about 32 times faster in sequential on my machine when called from CPython. When called from Numba, it is about 40 times faster. Using multiple threads should make the speed up far better (>100 on mainstream PC processors). This should be enough for the computation to last few seconds. I expect the final computation to be clearly memory-bound due to the huge size of p
(typically >200 GiB). Consider reducing its size for better performance (it does not seems reasonable to keep a >10 KiB array per pixel).