Can you help me understand why the concurrent solution differs from the serial solution? How do I fix this issue?
Serial: Use numpy.dot()
method on 1d array a
and 2d array W
. aa
and aa2
are the results from doing np.dot(a, W)
twice and as expected aa == aa2
is True
.
Concurrent: Split the Serial dot product of a
and W
into two concurrent jobs and then combine the results once both concurrent Jobs are done. The difference seems to arise from the tenth decimal place of the solution. They should be identical to the Serial solution(i.e. aa == cc
) but they are not. If you rerun this code a few times, you will notice the elements of aa == cc
is sometimes all False
and sometimes a mix of True
and False
. This phenomenon occurs from using either concurrent.futures
's ThreadPoolExecutor
and ProcessPoolExecutor
.
import numpy as np
import concurrent.futures as cf
from time import perf_counter
nrows = 2
ncols = nrows
size = nrows * ncols
ndepth = 4
# a is 1D, W is 2D
a = np.random.randint(0, 255, size)
print(f'a.shapes = {a.shape}')
W = np.random.standard_normal((size, ndepth))
print(f'W.shapes = {W.shape}')
astart = perf_counter()
aa = np.dot(a, W)
aend = perf_counter()
print(f'aa.shapes = {aa.shape}')
print(f'a = {a}')
print(f'W = {W}')
print(f'aa = {aa}')
aa2 = np.dot(a, W)
print(f'aa == aa2 : {aa== aa2}')
print(f'aa - aa2 = {aa- aa2}')
ncpus = 2
batchsize = int(size/ncpus)
print(f'ncpus = {ncpus}')
print(f'batchsize = {batchsize}')
a_split = a.reshape(ncpus, batchsize)
# with cf.ProcessPoolExecutor(max_workers=ncpus) as executor:
with cf.ThreadPoolExecutor(max_workers=ncpus) as executor:
futures = []
rstart = 0
rend = batchsize
cstart = perf_counter()
for n, input in enumerate(a_split):
print(f'rstart={rstart} rend={rend}')
futures.append(executor.submit(np.dot, input, W[rstart:rend, :]))
rstart = batchsize
rend = batchsize
futures = cf.wait(futures, timeout=None, return_when='ALL_COMPLETED')
results = [future.result() for future in futures.done]
cc = np.sum(results, dtype='float64', axis=0)
cend = perf_counter()
print(f'cc = {cc}')
print(f'cc.shape = {cc.shape}')
print(f'aa == cc: {aa==cc}')
print(f'aa - cc: {aa-cc}')
print()
print(f'{size} default : {aend-astart}')
print(f'{size} concurrent: {cend-cstart}')
Result:
a.shapes = (4,)
W.shapes = (4, 4)
aa.shapes = (4,)
a = [230 199 184 221]
W = [[ 0.51084911 1.35322019 -0.48898356 0.85261218]
[-0.49142492 -0.17330179 1.26965748 1.41627874]
[ 1.06185547 -1.21226666 0.22470322 -2.11876822]
[-1.39038965 1.15753592 0.12600325 -0.00918285]]
aa = [-92.19296978 309.51196126 209.38773113 86.05750838]
aa == aa2 : [ True True True True]
aa - aa2 = [0. 0. 0. 0.]
ncpus = 2
batchsize = 2
rstart=0 rend=2
rstart=2 rend=4
cc = [-92.19296978 309.51196126 209.38773113 86.05750838]
cc.shape = (4,)
aa == cc: [False True True False]
aa - cc: [-4.26325641e-14 0.00000000e 00 0.00000000e 00 -1.42108547e-14]
4 default : 1.9865998183377087e-05
4 concurrent: 0.000531537996721454
[Finished in 0.181s]
Cause of issue?
I think that there is no issue with my algorithm for performing concurrent dot products of a
and W
. The above issue goes away when I change
W = np.random.standard_normal((size, ndepth))
to
W = np.arange(size*ndepth).reshape(size, ndepth)
Meaning, the issue occurs when W.dtype
is float64
but not when W.dtype
is int64
.
CodePudding user response:
I will refer you to the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic. In particular for your issue, floating point arithmetic is not associative nor distributive. a (b c)
does not necessarily equal (a b) c
, nor does a*(b c)
necessarily equal a*b a*c
.
To make matters more complicated, some processors also have a FMA (fused multiply-add) instruction that numpy might use, which computes a*(b c)
in a single instruction, with a potentially different (more accurate, but different) result than first computing b c
then multiplying by a
.
Ultimately one must treat floating point arithmetic always as an approximation, unless you have very carefully analyzed your arithmetic and input values and found them to not lose any precision (which virtually never happens for general arithmetic).