Basic arrays copy OpenCL GPU-CodePudding

Can someone help me to figure out of to traslate this C code for CPU, to kernel code for GPU

int a[N], b[N];
b[0] = a[0];
b[N] = a[N];

for (i=1; i<N-1; i  ) 
    b[i]= a[i-1]   a[i]   a[i 1];

I thought about writing it this way, but I would like to find a better performing solution

__kernel void adjacentCopy(__global double *a, __global double *b, const unsigned int n) {

    int gid = get_global_id(0);

    if (gid < N) 
        b[gid] = a[gid-1] a[gid] a[gid 1];
    
}
    // and than execute the two base case into the host

Anyone can suggest a way to organize the code to use local memory, and also bringing back the two extremes cases into the kernel, without adding branches divergence

CodePudding user response：

A kernel in essence is a for-loop, of which every iteration runs in parallel. The exact order of execution is random, so there must not be any data dependencies from one iteration to the next; otherwise you have to use a double buffer (only read from one buffer and only write to the other).

In your case, the kernel would read:

__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    if(gid==0||gid==N-1) return; // guard clause: do not execute the first and last element
    b[gid] = a[gid-1] a[gid] a[gid 1]; // double buffers to resolve data dependencies: only read from a and only write to b
}

With the extreme cases gid==0||gid==N-1, on such a computational grid you typically use periodic boundary conditions. Then the kernel would become branchless and look like this:

__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    b[gid] = a[(gid N-1)%N] a[gid] a[(gid 1)%N]; // periodic boundaries with modulo; in "(gid N-1)" the " N" ensures that the argument of the modulo operator always is positive
}

Now for the local memory optimization: Without it, for every thread, you read 3 neighboring values of a from slow global memory. In theory, you could only load one element per thread from global memory and use fast local memory to share the data within the workgroup. But the two threads at gid==0||gid==N-1 will have to load 2 values from global memory, introducing branching, and this will likely kill any potential gain in performance. The added complication, together with no significant gains in performance, make the local memory optimization an unfavourable choice in this case. This is how the kernel would look like:

#define def_workgroup_size 128 // set this to the size of the workgroup
__kernel void adjacentCopy(const __global double *a, __global double *b, const unsigned int N) {
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    __local double cached_a[def_workgroup_size 2]; // as large as the workgroup, plus neighbors on the left and right sides of the workgroup
    if(lid==0) cached_a[lid] = a[(gid N-1)%N]; // first thread in workgroup also has to load left neighbor
    cached_a[lid 1] = a[gid];
    if(lid==def_workgroup_size-1) cached_a[lid 1] = a[(gid 1)%N]; // last thread in workgroup also has to load right neighbor
    barrier(CLK_LOCAL_MEM_FENCE); // barrier to make sure cached_a is entirely filled up
    b[gid] = cached_a[lid] cached_a[lid 1] cached_a[lid 2]; // read 3 values from local memory
}