I have a piece of code in my full code:
const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
int ix,jy,kz;
int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix cuba cubn*(p.jy cuba cubn*(p.kz cuba)));}
...
#pragma omp parallel for
for(int i=0; i<cub3; i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; i)
{
int ic=GetIndex(particles[i]);
#pragma omp atomic update
Length[ic] ;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; i) Begin[i]=Begin[i-1] Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; i)
{
if(particles[i].ip==3)
{
int ic=GetIndex(particles[i]);
if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
int cnt=0;
#pragma omp atomic capture
cnt=Counter[ic] ;
MIndex[Begin[ic] cnt]=i;
}
}
If to remove
#pragma omp parallel for
the code works properly and the output results are always the same. But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results. How to fix this issue?
Update: The task is the following. Have lots of particles with some random coordinates. Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system. So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on. The order of indices within given cell in the area MIndex is not important, may be arbitrary. If it is possible, need to make this in parallel, may be using atomic operations.
There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles. But for large number of cells and particles this seems to be slow. Is there a faster approach? Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?
CodePudding user response:
You are right to make the update Counter[ic]
atomic, but there is an additional problem on the next line: MIndex[Begin[ic] cnt]=i;
Different iterations can write into the same location here, unless you have mathematical proof that this is never the case from the structure of MIndex
. So you have to make that line atomic too. And then there is almost no parallel work left in your loop, so your speed up if probably going to be abysmal.
EDIT the second line however is not of the right form for an atomic operation, so you have to make it critical
. Which is going to make performance even worse.
Also, @Laci is correct that since this is an overwrite statement, the order of parallel scheduling is going to influence the outcome. So either live with that fact, or accept that this can not be parallelized.
CodePudding user response:
You probably can't get a compiler to auto-parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below). But you might be able to use OpenMP as a way to start threads and get thread IDs.
Keep per-thread count arrays from the initial histogram, use in 2nd pass
(Update: this might not work: I didn't notice the if(particles[i].ip==3)
in the source before. I was assuming that Count[ic]
will go as high as Length[ic]
in the serial version. If that's not the case, this strategy might leave gaps or something.
But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.)
Manually multi-thread the first histogram (into Length[]
), with each thread working on a known range of i
values. Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].
So Length[thread][ic]
is the number of particles in that cube, out of the range of i
values that this thread worked on. (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.)
Pre-process that into a per-thread Begin[][]
array, so each thread knows where in MIndex to put data from each bucket.
// pseudo-code, fairly close to actual C
for(ic < cub3) {
// perhaps do this "vertical" sum into a temporary array
// or prefix-sum within Length before combining across threads?
int pos = sum(Length[0..nthreads-1][ic-1]) Begin[0][ic-1];
Begin[0][ic] = pos;
for (int t = 1 ; t<nthreads ; t ) {
pos = Length[t][ic]; // prefix-sum across threads for this cube bucket
Begin[t][ic] = pos;
}
}
This has a pretty terrible cache access pattern, especially with cuba=8
making Length[t][0]
and Length[t 1][0]
4096 bytes apart from each other. (So 4k aliasing is a possible problem, as are cache conflict misses).
Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.
Then in the final loop with MIndex
, each thread can do int pos = --Length[t][ic]
to derive a unique ID from the Length. (Like you were doing with Count[]
, but without introducing another per-thread array to zero.)
Each element of Length will return to zero, because the same thread is looking at the same points it just counted. With correctly-calculated Begin[t][ic]
positions, MIndex[...] = i
stores won't conflict. False sharing is still possible, but it's a large enough array that points will tend to be scattered around.
Don't overdo it with number of threads, especially if cuba
is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done. OTOH, with cuba=8
meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much.
(Previous answer before your edit made it clearer what was going on.)
Is this basically a histogram? If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you). But it seems you also need this count to be unique within each voxel, to have MIndex updated properly? That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.
After your update, you are doing a histogram into Length[], so that part can be sped up.
Atomic RMWs would be necessary for your code as-is, performance disaster
Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly. On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.
As opposed to a single thread which can have cache misses to multiple Counter
, Begin
and MIndex
elements outstanding, using non-atomic accesses. (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]
can be doing the load while there are cache misses outstanding for Begin[ic]
and/or for Mindex[]
stores.)
ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64. (Again, OpenMP might not be able to do that for you.)
Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter
accesses are scattered enough they cores aren't constantly fighting over the same cache lines. You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2. (False sharing is a problem,
Perhaps software prefetch can help, like prefetchw
(write-prefetching) the counter for 5 or 10 i
iterations later.
It wouldn't be deterministic which point went in which order, even with memory_order_seq_cst
increments, though. Whichever thread increments Counter[ic]
first is the one that associates that cnt
with that i
.
Alternative access patterns
Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets. So the set of Counter[]
elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.
Filtering by p.kz
ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[]
.
But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work. And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.
Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.
Re: your edit
You already loop over the whole array of points doing Length[ic] ;
? Seems redundant to do the same histogramming work again with Counter[ic] ;
, but not obvious how to avoid it.
The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel. At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end. Should scale perfectly with threads since the count array is small enough for L1d cache.
BTW, for() Length[i]=Counter[i]=0;
is too small to be worth parallelizing. For cuba=8
, it's 8*8*16 * sizeof(int)
= 4096 bytes, just one page, so it's just two small memsets.
(Of course if each thread has their own separate Length array, they each need to zero it). That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket. Combining count arrays at the end is a job for #pragma omp simd
or just normal auto-vectorization with gcc -O3 -march=native
since it's integer work.
For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i]
, and another counting up from 0 in Counter[i]
. With different threads looking at different points, this could give you a factor of 2 speedup. (Modulo contention for MIndex stores.)
To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily. See the section at the top