In my code I have an array of point position and a sparse incidence matrix. For every non-zero ij-element of the matrix I want to calculate the distance between the i-point and j-point.
Currently I'm extracting the indices with the 'nonzero()', however this method sort the output indices, and this sorting is taking most of the execution time of my entire application.
import numpy as np
import scipy as sc
import scipy.sparse
#number of points
n = 100000
#point positions
pos = np.random.rand(n,3)
#building incidence matrix, defining the pair wise calculation
density = 1e-3
n_entries = int(n**2*density)
indx = np.random.randint(0,n, n_entries)
indy = np.random.randint(0,n, n_entries)
Pairs = scipy.sparse.csc_matrix((np.ones(n_entries) ,(indx,indy)) , (n,n))
#current method
ii,jj = Pairs.nonzero() #this part is slow
d = np.linalg.norm(pos[ii] - pos[jj], axis = 1)
distance_matrix = scipy.sparse.csr_matrix((d , (ii,jj)) , (n,n))
CodePudding user response:
Modified answer
The main problem is that there are sometimes duplicates in indx, indy
. For this reason, we cannot simply do:
d2 = np.linalg.norm(pos[indx] - pos[indy], axis=1)
distance_matrix2 = scipy.sparse.csr_matrix((d2, (indx, indy)), (n, n))
...because then the values at duplicated indices would be multiplied by the number of duplicates.
The original answer below was first deduplicating indices. But even with the faster pd_unique2_idx
, the whole operation ends up to take more time than the original.
Instead, since you already calculated Pairs
(which, if you noticed, also has the duplicate indices problem: Pairs.max()
often gives more than 1.0), let us try to get indices of non-zero values without any sorting, just using what is in the structure. Here is a way to do it, following the definition of indptr
and indices
in the docs:
ii = Pairs.indices
jj = np.repeat(np.arange(len(Pairs.indptr) - 1), np.diff(Pairs.indptr))
What you get is the same indices as your own ii, jj
, but ordered differently (column-major instead of row-major).
Timing
Just looking at the part that gets ii, jj
:
%timeit Pairs.nonzero()
# 1.27 s ± 5.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.repeat(np.arange(len(Pairs.indptr) - 1), np.diff(Pairs.indptr))
# 41.3 ms ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In context:
def orig(pos, Pairs, n):
#current method
ii, jj = Pairs.nonzero() #this part is slow
d = np.linalg.norm(pos[ii] - pos[jj], axis=1)
distance_matrix = scipy.sparse.csr_matrix((d, (ii, jj)), (n, n))
return distance_matrix
def modified(pos, Pairs, n):
# only change: these two lines:
ii = Pairs.indices
jj = np.repeat(np.arange(len(Pairs.indptr) - 1), np.diff(Pairs.indptr))
# unchanged
d = np.linalg.norm(pos[ii] - pos[jj], axis=1)
distance_matrix = scipy.sparse.csr_matrix((d, (ii, jj)), (n, n))
return distance_matrix
On your data (with np.random.seed(0)
to start, for repeatable results):
%timeit orig(pos, Pairs, n)
# 2.16 s ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit modified(pos, Pairs, n)
# 1.11 s ± 4.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Correctness
dm = orig(pos, Pairs, n)
dmod = modified(pos, Pairs, n)
np.all(np.c_[dm.nonzero()] == np.c_[dmod.nonzero()])
# True
i, j = dm.nonzero()
np.all(dm[i, j] == dmod[i, j])
# True
Original answer
You could calculate another list of distances in indx, indy
order, then make a sparse matrix of that:
d2 = np.linalg.norm(pos[indx] - pos[indy], axis=1)
distance_matrix2 = scipy.sparse.csr_matrix((d2, (indx, indy)), (n, n))
However, comparison with your own distance matrix shows some rare but possible differences: they come from the case where you have repeated locations in indx, indy
and the corresponding Pairs.toarray()[indx, indy]
is the number of repeats. This stems from a well-known "feature" of csc
and csr
sparse matrices that duplicate elements are summed. E.g. here or here.
For example:
>>> indx = np.array([0,1,0])
>>> indy = np.array([2,0,2])
>>> scipy.sparse.csc_matrix((np.ones(3), (indx, indy))).toarray()
array([[0., 0., 2.],
[1., 0., 0.]])
Unfortunately, the solution proposed by @hpaulj in the posts linked above using todok()
do not work anymore with recent versions of scipy.
Using np.unique()
is slow, as it sorts the values. pd.unique()
is faster (no sort), but accepts only 1D arrays. Here is a way to do a 2D unique that is fast. Note that we need the indices of the unique rows, so that we can index indx
and indy
, but also the distance matrix if that was already calculated.
def pd_unique2_idx(indx, indy):
df = pd.DataFrame(np.c_[indx, indy])
return np.nonzero(~df.duplicated().values)
# by contrast (slower on long sequences, because it sorts)
def np_unique2_idx(indx, indy):
rc = np.vstack([indx, indy]).T.copy()
dt = rc.dtype.descr * 2
i = np.unique(rc.view(dt), return_index=True)[1]
return i
On 1M elements for both indx
and indy
, I see:
pd_unique2_idx
: 52.1 ms ± 529 µs per loopnp_unique2_idx
: 972 ms ± 19.5 ms per loop
How to use this? Following your setup:
i = pd_unique2_idx(indx, indy)
indx = indx[i]
indy = indy[i]
d2 = np.linalg.norm(pos[indx] - pos[indy], axis=1)
distance_matrix2 = scipy.sparse.csr_matrix((d2, (indx, indy)), (n, n))
CodePudding user response:
With your sample arrays:
In [104]: Pairs = sparse.csc_matrix((np.ones(n_entries) ,(indx,indy)) , (n,n))
the time create the csc
itself is not trivial:
In [105]: timeit Pairs = sparse.csc_matrix((np.ones(n_entries) ,(indx,indy)) , (
...: n,n))
1.08 s ± 44.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
nonzero
converts the matrix to coo
, and then does a extra check for 0s:
def nonzero(self):
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask], A.col[nz_mask])
The coo
conversion is relatively fast. I don't think there's any sorting involved.
In [106]: timeit Pairs.tocoo()
221 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That extra masking does add time.
In [107]: timeit Pairs.nonzero()
1.86 s ± 37.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Since a freshly made csc
will not have any explicit 0s, you could do something like:
M = Pairs.tocoo()
ii,jj = M.row, M.col
Creating the coo
matrix directly is faster:
In [108]: timeit sparse.coo_matrix((np.ones(n_entries) ,(indx,indy)) , (n,n))
158 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This sets the row
and col
to indx
and indy
directly (or at most with a dtype change). It's the csc
creation that sorts the indices (lexically by row and then column), and does any duplicate summation.