Say I have a num_indices * n indices matrix (in range(m)) and a num_indices * n values matrix, i.e.,

m, n = 100, 50
num_indices = 100000
indices = np.random.randint(0, m, (num_indices, n))
values = np.random.uniform(0, 1, (num_indices, n))

I want to get a result matrix in shape m * n, such that each column is indexed and summed over the corresponding columns in indices and values matrices. Here are some of the implementations:

The basic for loop

tic = time.time()
res1 = np.zeros((m, n))
for i in range(num_indices):
    for j in range(n):
        res1[indices[i, j], j]  = values[i, j]
print(f'element-wise for loop used {time.time() - tic:.5f}s')

np.add.at, in multidimensional

tic = time.time()
res2 = np.zeros((m, n))
np.add.at(res2, (indices, np.arange(n)[None, :]), values)
print(f'np.add.at used {time.time() - tic:.5f}s')

np.add.at, with for loop

tic = time.time()
reslst = []
for colid in range(n):
    rescol = np.zeros((m, 1))
    np.add.at(rescol, (indices[:, colid], np.zeros(num_indices, dtype=int)), values[:, colid])
    reslst.append(rescol)
res3 = np.hstack(reslst)
print(f'np.add.at with loop used {time.time() - tic:.5f}s')

scipy.sparse, in multidimensional

from scipy import sparse
tic = time.time()
res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.tile(np.arange(n), num_indices))), shape=(m, n)).toarray()
print(f'sparse.csr used {time.time() - tic:.5f}s')

scipy.sparse, with for loop

tic = time.time()
reslst = []
for colid in range(n):
    rescol = sparse.csr_matrix((values[:, colid], (indices[:, colid], np.zeros(num_indices, dtype=int))), shape=(m, 1)).toarray()
    reslst.append(rescol)
res5 = np.hstack(reslst)
print(f'sparse.csr with loop used {time.time() - tic:.5f}s')

The results are all same:

assert np.isclose(res2, res1).all()
assert np.isclose(res3, res1).all()
assert np.isclose(res4, res1).all()
assert np.isclose(res5, res1).all()

However, the speed is weird, and not satisfying:

for loop used 1.69889s
np.add.at used 0.17287s
np.add.at with loop used 0.47767s
sparse.csr used 0.16847s
sparse.csr with loop used 0.09845s

My basic questions are:

Why is np.add.at so slow - even slower than scipy.sparse? I though numpy would benefit a lot, especially in multidimentional case.
For scipy.sparse, why multidimensional is even slower than for loop? Isn't concurrency used? (and why for numpy, multidimensional is faster?)
Overall, is there any faster way to implement what I want?

CodePudding user response：

Surprisingly, it turns out that np.add.at is very inefficiently implemented in Numpy.

Regarding scipy.sparse, I cannot reproduce the same performance improvement on my machine with Scipy 1.7.1: it is barely faster. Like Numpy's np.add.at, it is far from being fast.

You can implement efficiently this operation using Numba:

import numba as nb

@nb.njit('(int_[:,::1], float64[:,::1], int_, int_)')
def compute(indices, values, n, m):
    res = np.zeros((m, n), dtype=np.float64)
    for i in range(num_indices):
        for j in range(n):
            #assert 0 <= indices[i, j] < m
            res[indices[i, j], j]  = values[i, j]
    return res

tic = time.time()
result = compute(indices, values, n, m)
print(f'Numba used {time.time() - tic:.5f}s')

Note that the function assume that indices contains valid values (ie. no out of bounds). Otherwise, the function can crash or can silently corrupt the program. You can enable the assertion if you are not sure about that at the expensive of a slower computation (twice slower on my machine).

Note that this implementation is fast as long as 8 * n * m is small enough so the output array fit in the L1 cache (generally from 16 KB to 64 KB). Otherwise, it can be a bit slower due to the inefficient random access pattern. If the output array doe not fit in the L2 cache (generally few hundreds of KB), then this will be significantly slower.

Results

element-wise for loop used 2.45158s
np.add.at used 0.28600s
sparse.csr used 0.19600s
sparse.csr with loop used 0.18900s
Numba used 0.00500s

Thus, Numba is about 40 times faster than the fastest implementations and ~500 times faster than the slowest one. The numba code is much faster since the code is compiled to an optimized native binary with no additional overhead (ie. bound-checking, temporary arrays, function calls) as opposed to Numpy and Scipy.

CodePudding user response：

Replicating this to better understand what's happening:

In [48]: m, n = 100, 50
    ...: num_indices = 100000
    ...: indices = np.random.randint(0, m, (num_indices, n))
    ...: values = np.random.uniform(0, 1, (num_indices, n))

The add.at case

In [49]: res2 = np.zeros((m, n))
    ...: np.add.at(res2, (indices, np.arange(n)[None, :]), values)
In [50]: res2
Out[50]: 
array([[472.25600004, 526.36528387, 493.10320355, ..., 548.43850955,
        497.3412013 , 508.51132267],
       ...])
In [51]: %%timeit
    ...: res2 = np.zeros((m, n))
    ...: np.add.at(res2, (indices, np.arange(n)[None, :]), values)
613 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The sparse approach

In [52]: from scipy import sparse
In [53]: res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.tile(np.
    ...: arange(n), num_indices))), shape=(m, n)).toarray()
In [54]: res4
Out[54]: 
array([[472.25600004, 526.36528387, 493.10320355, ..., 548.43850955,
        497.3412013 , 508.51132267],
       [....)

Values are the same and there's nothing pathological about them (i.e. not all 0s. Just the sums of a lot of floats.)

And indeed, the sparse times are a bit better:

In [55]: timeit res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.t
    ...: ile(np.arange(n), num_indices))), shape=(m, n)).toarray()
402 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Usually I find that creating a sparse matrix takes longer than making a dense one, especially when the size is only (100,50). And in sense, it does take long, half a second. But it's short compared to the add.at.

Replacing the values with 1's, I confirm that on average, each array element has 1000 duplicates (I could have deduced that from the res2 values and their (0,1) random distribution).

In [57]: res5 = sparse.csr_matrix((np.ones_like(indices.ravel()), (indices.ravel
    ...: (), np.tile(np.arange(n), num_indices))), shape=(m, n)).toarray()
In [58]: res5
Out[58]: 
array([[ 971, 1038, 1004, ..., 1056,  988, 1030],
       [1016,  992,  949, ...,  994,  982, 1011],
       [ 984, 1014,  980, ..., 1001, 1053, 1057],
       ...,])

If I recall past sparse code explorations correctly, with these inputs, csr_matrix first makes a coo matrix, then sorts indices lexically - that is sorts on rows, and within those on columns. This puts duplicate indices together, which it can easily sum. Given the large number of duplicates it shouldn't be surprising that this is relatively efficient.

The randomness is just in the row indices. The sparse_with_loop takes advantage of this, by iterating on n, 50. So creating each (100,1) sparse matrix is easier; again just sorting on those row indices and doing the sum.

In [61]: %%timeit
    ...: reslst = []
    ...: for colid in range(n):
    ...:     rescol = sparse.csr_matrix((values[:, colid], (indices[:, colid], n
    ...: p.zeros(num_indices, dtype=int))), shape=(m, 1)).toarray()
    ...:     reslst.append(rescol)
    ...: res5 = np.hstack(reslst)
258 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

I might add that np.add.at, while faster than the full python loop, usually is slower than the buffered = that it corrects.

In [72]: %%timeit
    ...: res10 = np.zeros((m,n))
    ...: res10[indices, np.arange(n)]  = values
103 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

To put these times in perspective, the time to generate the values array is:

In [75]: timeit values = np.random.uniform(0, 1, (num_indices, n))

97.5 ms ± 90.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

about 1/6 of the add.at time. And the time to create a sparse matrix from that:

In [76]: timeit sparse.csr_matrix(np.random.uniform(0, 1, (num_indices, n)))
379 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So times on the order of 300 ms to perform the indexed addition are not out of the ordinary. Handling (m,n) shaped arrays, in any way, will take time.

The res5 case, counting duplicates can be done with bincount:

In [88]: np.array([np.bincount(indices[:,i]) for i in range(n)]).T
Out[88]: 
array([[ 971, 1038, 1004, ..., 1056,  988, 1030],
       [1016,  992,  949, ...,  994,  982, 1011],
       [ 984, 1014,  980, ..., 1001, 1053, 1057],
       ...,])
In [89]: timeit np.array([np.bincount(indices[:,i]) for i in range(n)]).T
64.7 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edit

Interestingly, if I make a coo_matrix times are quite a bit faster:

In [95]: timeit res4 = sparse.coo_matrix((values.ravel(), (indices.ravel(), np.t
    ...: ile(np.arange(n), num_indices))), shape=(m, n)).toarray()
78.3 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It's still doing the sum of duplicates step, but as part of the to_array() step. But it doesn't have to do the extra work of creating the csr matrix.

edit

I just remembered that previous add.at SO have used bincount with weights:

In [105]: res0 = np.array([np.bincount(indices[:,i],weights=values[:,i]) for i i
     ...: n range(n)]).T
In [106]: np.allclose(res2,res0)
Out[106]: True
In [107]: timeit res0 = np.array([np.bincount(indices[:,i],weights=values[:,i])
     ...: for i in range(n)]).T
142 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)