Why the concurrent.futures solution differs from the serial solution? nump.dot(unit64,float64)-CodePudding

Can you help me understand why the concurrent solution differs from the serial solution? How do I fix this issue?

Serial: Use numpy.dot() method on 1d array a and 2d array W. aa and aa2 are the results from doing np.dot(a, W) twice and as expected aa == aa2 is True.

Concurrent: Split the Serial dot product of a and W into two concurrent jobs and then combine the results once both concurrent Jobs are done. The difference seems to arise from the tenth decimal place of the solution. They should be identical to the Serial solution(i.e. aa == cc) but they are not. If you rerun this code a few times, you will notice the elements of aa == cc is sometimes all False and sometimes a mix of True and False. This phenomenon occurs from using either concurrent.futures's ThreadPoolExecutor and ProcessPoolExecutor.

import numpy as np
import concurrent.futures as cf
from time import perf_counter

nrows = 2
ncols = nrows
size = nrows * ncols
ndepth = 4

# a is 1D, W is 2D
a = np.random.randint(0, 255, size)
print(f'a.shapes = {a.shape}')

W = np.random.standard_normal((size, ndepth))
print(f'W.shapes = {W.shape}')

astart = perf_counter()
aa = np.dot(a, W)
aend = perf_counter()
print(f'aa.shapes = {aa.shape}')

print(f'a = {a}')
print(f'W = {W}')
print(f'aa = {aa}')

aa2 = np.dot(a, W)
print(f'aa == aa2 : {aa== aa2}')
print(f'aa - aa2 = {aa- aa2}')

ncpus = 2
batchsize = int(size/ncpus)
print(f'ncpus = {ncpus}')
print(f'batchsize = {batchsize}')
a_split = a.reshape(ncpus, batchsize)

# with cf.ProcessPoolExecutor(max_workers=ncpus) as executor:
with cf.ThreadPoolExecutor(max_workers=ncpus) as executor:
    futures = []
    rstart = 0
    rend = batchsize
    cstart = perf_counter()
    for n, input in enumerate(a_split):
        print(f'rstart={rstart}  rend={rend}')
        futures.append(executor.submit(np.dot, input, W[rstart:rend, :]))
        rstart  = batchsize
        rend  = batchsize
    futures = cf.wait(futures, timeout=None, return_when='ALL_COMPLETED')
    results = [future.result() for future in futures.done]
cc = np.sum(results, dtype='float64', axis=0)
cend = perf_counter()
print(f'cc = {cc}')
print(f'cc.shape = {cc.shape}')
print(f'aa == cc: {aa==cc}')
print(f'aa - cc: {aa-cc}')
print()
print(f'{size} default   : {aend-astart}')
print(f'{size} concurrent: {cend-cstart}')

Result:

a.shapes = (4,)
W.shapes = (4, 4)
aa.shapes = (4,)
a = [230 199 184 221]
W = [[ 0.51084911  1.35322019 -0.48898356  0.85261218]
 [-0.49142492 -0.17330179  1.26965748  1.41627874]
 [ 1.06185547 -1.21226666  0.22470322 -2.11876822]
 [-1.39038965  1.15753592  0.12600325 -0.00918285]]
aa = [-92.19296978 309.51196126 209.38773113  86.05750838]
aa == aa2 : [ True  True  True  True]
aa - aa2 = [0. 0. 0. 0.]
ncpus = 2
batchsize = 2
rstart=0  rend=2
rstart=2  rend=4
cc = [-92.19296978 309.51196126 209.38773113  86.05750838]
cc.shape = (4,)
aa == cc: [False  True  True False]
aa - cc: [-4.26325641e-14  0.00000000e 00  0.00000000e 00 -1.42108547e-14]

4 default   : 1.9865998183377087e-05
4 concurrent: 0.000531537996721454
[Finished in 0.181s]

Cause of issue?

I think that there is no issue with my algorithm for performing concurrent dot products of a and W. The above issue goes away when I change

W = np.random.standard_normal((size, ndepth))

W = np.arange(size*ndepth).reshape(size, ndepth)

Meaning, the issue occurs when W.dtype is float64 but not when W.dtype is int64.

CodePudding user response：

I will refer you to the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic. In particular for your issue, floating point arithmetic is not associative nor distributive. a (b c) does not necessarily equal (a b) c, nor does a*(b c) necessarily equal a*b a*c.

To make matters more complicated, some processors also have a FMA (fused multiply-add) instruction that numpy might use, which computes a*(b c) in a single instruction, with a potentially different (more accurate, but different) result than first computing b c then multiplying by a.

Ultimately one must treat floating point arithmetic always as an approximation, unless you have very carefully analyzed your arithmetic and input values and found them to not lose any precision (which virtually never happens for general arithmetic).