This (modulo 2) binary matrix multiplication algorithm seems to underperform. What can I do better?-CodePudding

The question has changed since its initial posting as I've chased down a few leads. At this point, I'd say that I'm really looking for the following answers:

Can a significant amount of time be saved by replacing addition/multiplication followed by a modulo 2 operation with and/logical_xor (assuming that the total number of such operations is kept the same)? If not, then why not? ANSWER: some time can indeed be saved, but it's arguable whether that amount is "significant".
Where can I read more about the specific approach taken by the BLAS matrix multiplication underlying numpy? Ideally, I'd like a source that doesn't require deciphering the FORTRAN code forged by the sages of the past. ANSWER: The original paper proposing the BLAS matrix multiplication algorithms used today

Minor updates:

I was able to test these out for larger matrices (up to 1000x1000) and get a better sense of the asymptotics here. It indeed seems to be the case that the "default" algorithm here is O(n^2.7), whereas the alternative is the expected O(n³) (the observed slopes were 2.703 and 3.133, actually).

I also checked how the alternative algorithm compared to the following implementation of "schoolbook" matrix multiplication followed by a mod operation.

def mat_mult_3(A,B): return np.sum(A[:,:,None]*B[None,:,:],axis = 1)%2

I was very surprised to find that this also does better than the and/xor based method!

In response to Michael's comment, I replaced mat_mult_2 with the following:

def mat_mult_2(A,B): return np.logical_xor.reduce(A.astype(bool)[:,:,None] & B.astype(bool)[None,:,:],axis = 1).astype(int)

This arguably still places an undue burden of type conversion on the method, but sticking to multiplication between boolean matrices didn't significantly change performance. The result is that mat_mult_2 now (marginally) outperforms mat_mult_3, as expected.

CodePudding user response：

For a modest n=10 lets compare some alternatives:

Using @ and modulus:

In [15]: timeit A@A%2 8.1 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Your alternative:

In [16]: timeit np.logical_xor.reduce(A[:,:,None]&A[None,:,:],axis = 1) 25 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The @ equivalent:

In [17]: timeit np.sum(A[:,:,None]&A[None,:,:], axis=1)%2 33.2 µs ± 65.7 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So the logical operations are somewhat faster, but not drastically so.

And to get an idea of how much time the modulus step takes - about 4us.

In [18]: timeit np.sum(A[:,:,None]&A[None,:,:], axis=1) 29.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [19]: timeit A@A 4.52 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

So in [15], the @ and modulus take about the same time.

edit

In [27]: timeit np.sum(A[:,:,None]*A[None,:,:], axis=1) 28.9 µs ± 81.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

CodePudding user response：

It looks like I mostly answered my own question. Here's a summary of what I found.

One way the method I proposed falls short of the numpy method is in its asymptotic complexity. Whereas my method follows the naive AKA "schoolbook" algorithm of matrix multiplication, numpy pulls its approach from the BLAS routines. My best guess is that numpy is using SGEMM method, which to my limited understanding based on some quick googling and article skimming seems to be a variant of the Strassen algorithm for matrix multiplication. So, where my method does O(n³) operations (for a product of two binary nxn matrices), numpy's method does O(n^2.8) (which is roughly borne out by my observations).

Another way my method falls short is the repeated implicit type conversions that occur when calling boolean methods on an array of integers. This can be avoided by using boolean arrays as the algorithm input.

The result, accounting for these discrepancies, is this: if the schoolbook algorithm is applied but addition and multiplication are replaced by XOR and AND, then (according to my trials) the computation time is reduced by about 20%. This isn't nothing, but less than I was expecting.

edit