I have been working on a multi-GPU project where I have had problems with obtaining non-deterministic results. I was surprised when it turned out that I obtained non-deterministic results due to a reduction clause executed on the CPU.
In the book Using OpenMP - The Next Step it is written that
"[...] the order in which threads combine their value to construct the value for the shared result is non-deterministic."
Maybe I just don't understand how the reduction clauses are implemented. Does it mean that if I use schedule(monotonic:static)
in combination with a reduction clause each thread will execute its chunk of the iterations in a deterministic order, but that the order in which the partial results are combined at the end of the parallel region is non-deterministic?
CodePudding user response:
Does it mean that if I use schedule(monotonic:static) in combination with a reduction clause each thread will execute its chunk of the iterations in a deterministic order, but that the order in which the partial results are combined at the end of the parallel region is non-deterministic?
It is known that the end result is non-determinist, detailed information can be found in:
What Every Computer Scientist Should Know about Floating Point Arithmetic. For instance:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x y) z has a totally different answer than x (y z) when x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the latter).
Now regarding the order in which the threads perform the reduction action, as far as I know, the OpenMP standard does not enforce any order, or requires that the order has to be deterministic. Hence, this is an implementation detail that is left up to the compiler that is implementing the OpenMP standard to decide, and consequently, it is something that your code should not reply upon.
CodePudding user response:
Programming language semantics usually declares that a b c d
is evaluated as a b) c) d
(fill in opening parens yourself). This is not parallel, so an OMP reduction is probably evaluated as (a b) (c d)
. And so on for larger numbers of summands.
So you immediately have that, because of the non-associativity of floating point arithmetic, the result may be subtly different from the sequential value.
But more importantly, the exact value will depend on precisely how the combination is done. Is it a (b c)
(on 2 threads) or (a b) c
? So the result is at least "indeterministic" in the sense that you can not reconstruct how it was formed. It could probably even be done in two different ways, if you run the same reduction twice. That's what I would call "non-deterministic", but look in the standard for the exact definition of the term.
Btw, if you want to get some idea of how OMP actually does it, write your own reduction operator, and let each invocation print out what it computes. Here is a decent illustration: https://victoreijkhout.github.io/pcse/omp-reduction.html#Initialvalueforreductions
Btw, the standard actually doesn't use the word "non-deterministic" for this case. The following passage explains the issue:
Furthermore, using different numbers of threads may result in different numeric results because of changes in the association of numeric operations. For example, a serial addition reduction may have a different pattern of addition associations than a parallel reduction.