Dependencies with OpenMP on a simple code-CodePudding

I haven't been using OpenMP for a long time and I have trouble optimizing this code:

#define SIZE 100000000

typedef struct {
  float a,b,c,d,e,f,g,h;
} s_t;
  
void compute(s_t *a) {
  int i;
  for (i=4; i<SIZE; i  ) {
    a[i].a=(a[i-1].b * 0.42   a[i-3].d * .32);
    a[i].b = a[i].c * 3.56 - a[i-3].f;
    a[i].c = a[i].d   a[i].g*a[i-3].d;
    a[i].d = .3334/sqrt(a[i].f*a[i].f   a[i].c*a[i].c);
    if (a[i].f   a[i].a>1e-3) 
      a[i].f = (a[i].f - a[i].a)*(a[i].f   a[i].a);

  }
}

int main() {
  int i;
  s_t *a;
  a=(s_t *)malloc(sizeof(s_t)*SIZE);
  /* Initialization */
  for (i=0; i<SIZE; i  ) 
    a[i].a=a[i].b=a[i].c=a[i].d=a[i].e=a[i].f=a[i].g=a[i].h=1./(i 1);
  /* Computation */
  for(i=0;i<100;i  ) {
    compute(a);
    fprintf(stderr,".");
  }
  fprintf(stderr,"%f ",a[10].a);
  free(a);
  return 0;
}

I would like to use a "#pragma omp parallel for" on the loop in the compute function but there are several dependencies.

I tried with the depend clause but I think that having a[i] depends of a[i-1] and a[i-3] will just make the code sequential. I don't really know how to handle this problem with OpenMP. Could you give me some ideas and or guidance on how to do it ?

I added the main so that you guys can see how the compute function is called. If you have any other ideas on how to optimize the code with OpenMP or any other method please let me know.

CodePudding user response：

The optimization of this code turns out to be very difficult because of the dependencies. There is no really efficient way to parallelize this code using multiple threads on modern mainstream processors. That being said, some operations are independent and are the same on multiple data. Thus, we can benefit from instruction-level parallelism and SIMD instructions. Indeed, a modern processor core can execute multiples instructions simultaneously during a given cycle and this instruction can operate on several values at the same time.

Additionally, we can also benefit from the fast reciprocal square root provided by mainstream x86-64 processors. In fact, this is what makes the -ffast-math flag faster as well as using FMA instructions so to reduce the latency of the critical path instructions. That being said, this instruction (as well as using -ffast-math causes the result to be possibly different from one machine to another and also possibly less precise. On top of that, double-precision constants can be replaced by floating-point ones as proposed by @Laci (not to mention the use of sqrtf, though there is no double-precision reciprocal square root instructions on x86-64 processor anyway).

Implementation

To be able to use SIMD instructions, one need to track the dependencies and find multiple same instruction working on different values in the C code. The first step is to unroll the loop 3 times due to the i-3 dependency. Then, the second step consists in gathering the same operations while taking care of dependencies. The results in an ugly big code like this:

void compute_unroll(s_t* a)
{
    int i;

    for (i=4; i<SIZE-2; i =3)
    {
        float dm3 = a[i-3].d;
        float dm2 = a[i-2].d;
        float dm1 = a[i-1].d;

        float fm3 = a[i-3].f;
        float fm2 = a[i-2].f;
        float fm1 = a[i-1].f;

        float bm1 = a[i-1].b;

        float a0 = a[i].a;
        float b0 = a[i].b;
        float c0 = a[i].c;
        float d0 = a[i].d;
        float e0 = a[i].e;
        float f0 = a[i].f;
        float g0 = a[i].g;

        float a1 = a[i 1].a;
        float b1 = a[i 1].b;
        float c1 = a[i 1].c;
        float d1 = a[i 1].d;
        float e1 = a[i 1].e;
        float f1 = a[i 1].f;
        float g1 = a[i 1].g;

        float a2 = a[i 2].a;
        float b2 = a[i 2].b;
        float c2 = a[i 2].c;
        float d2 = a[i 2].d;
        float e2 = a[i 2].e;
        float f2 = a[i 2].f;
        float g2 = a[i 2].g;

        b0 = c0 * 3.56f - fm3;
        b1 = c1 * 3.56f - fm2;
        b2 = c2 * 3.56f - fm1;

        a0 = bm1 * 0.42f   dm3 * 0.32f;
        a1 = b0 * 0.42f   dm2 * 0.32f;
        a2 = b1 * 0.42f   dm1 * 0.32f;

        c0 = d0   g0*dm3;
        c1 = d1   g1*dm2;
        c2 = d2   g2*dm1;

        d0 = 0.3334f / sqrtf(f0*f0   c0*c0);
        d1 = 0.3334f / sqrtf(f1*f1   c1*c1);
        d2 = 0.3334f / sqrtf(f2*f2   c2*c2);

        if (f0   a0 > 1e-3f) f0 = (f0 - a0) * (f0   a0);
        if (f1   a1 > 1e-3f) f1 = (f1 - a1) * (f1   a1);
        if (f2   a2 > 1e-3f) f2 = (f2 - a2) * (f2   a2);

        a[i].a = a0;
        a[i].b = b0;
        a[i].c = c0;
        a[i].d = d0;
        a[i].f = f0;

        a[i 1].a = a1;
        a[i 1].b = b1;
        a[i 1].c = c1;
        a[i 1].d = d1;
        a[i 1].f = f1;

        a[i 2].a = a2;
        a[i 2].b = b2;
        a[i 2].c = c2;
        a[i 2].d = d2;
        a[i 2].f = f2;
    }

    for (; i<SIZE;   i)
    {
        a[i].a = (a[i-1].b * 0.42f   a[i-3].d * 0.32f);
        a[i].b = a[i].c * 3.56f - a[i-3].f;
        a[i].c = a[i].d   a[i].g*a[i-3].d;
        a[i].d = 0.3334f / sqrtf(a[i].f*a[i].f   a[i].c*a[i].c);

        if (a[i].f   a[i].a > 1e-3f)
            a[i].f = (a[i].f - a[i].a) * (a[i].f   a[i].a);
    }
}

Now, we can see that some parts can easily benefit from using SIMD instructions like this one for example:

d0 = 0.3334f / sqrtf(f0*f0   c0*c0);
d1 = 0.3334f / sqrtf(f1*f1   c1*c1);
d2 = 0.3334f / sqrtf(f2*f2   c2*c2);

Some other parts require data from multiple location in memory. gathering data from non-contiguous location is generally not efficient. The problem is that the input layout is not efficient in the first place. It would be more efficient to store data by storing it like a0 a1 a2 b0 b1 b2 c0 c1 c2 ... rather than a0 b0 c0 d0 e0 f0 g0 a1 b1 c1 .... Indeed, this alternative layout enables the processor to load/store data block by block (SIMD load/store) rather than 1 value at a time (scalar load). That being said, it may not be possible to change the data layout regarding the context in which this code is used. Because of that, the following part of this answer will not consider this optimization.

Now we can vectorize the code using SIMD instructions. There are many ways to do this. There are for example some relatively high-level libraries to do that. OpenMP also help to do that. However, the performance of these solutions tend to be quite disappointing for a pathological unusual case like this one. Consequently, I choose to use low-level x86-64 SSE intrinsics. More specifically, I considered the SSE4.1 instruction set (supported by >99% of PCs) and the FMA instruction set (also widely supported and fmadd/fmsub instructions can easily be replaced by fmul fadd/fsub instructions anyway if needed).

void compute_sse(s_t* a)
{
    int i = 4;

    if (i<SIZE-2)
    {
        __m128 vdm = _mm_setr_ps(a[i-3].d, a[i-2].d, a[i-1].d, 0.0f);
        __m128 vfm = _mm_setr_ps(a[i-3].f, a[i-2].f, a[i-1].f, 0.0f);
        float bm1 = a[i-1].b;

        for (; i<SIZE-2; i =3)
        {
            float a0 = a[i].a, a1 = a[i 1].a, a2 = a[i 2].a;
            float b0 = a[i].b, b1 = a[i 1].b, b2 = a[i 2].b;
            float c0 = a[i].c, c1 = a[i 1].c, c2 = a[i 2].c;
            float d0 = a[i].d, d1 = a[i 1].d, d2 = a[i 2].d;
            float e0 = a[i].e, e1 = a[i 1].e, e2 = a[i 2].e;
            float f0 = a[i].f, f1 = a[i 1].f, f2 = a[i 2].f;
            float g0 = a[i].g, g1 = a[i 1].g, g2 = a[i 2].g;

            // vb[j] = vc[j] * 3.56f - vfm[j]
            __m128 vc = _mm_setr_ps(c0, c1, c2, 0.0f);
            __m128 vb = _mm_fmsub_ps(vc, _mm_set1_ps(3.56f), vfm);

            _MM_EXTRACT_FLOAT(b0, vb, 0);
            _MM_EXTRACT_FLOAT(b1, vb, 1);
            _MM_EXTRACT_FLOAT(b2, vb, 2);

            a[i].b = b0;
            a[i 1].b = b1;
            a[i 2].b = b2;

            // va[j] = vb_prev[j] * 0.42f   vdm[j] * 0.32f
            __m128 vb_prev = _mm_setr_ps(bm1, b0, b1, 0.0f);
            __m128 va = _mm_fmadd_ps(vb_prev, _mm_set1_ps(0.42f), _mm_mul_ps(vdm, _mm_set1_ps(0.32f)));

            // Store data to memory
            _MM_EXTRACT_FLOAT(a[i].a, va, 0);
            _MM_EXTRACT_FLOAT(a[i 1].a, va, 1);
            _MM_EXTRACT_FLOAT(a[i 2].a, va, 2);

            // vc[j] = vg[j] * vdm[j]   vd[j]
            __m128 vd = _mm_setr_ps(d0, d1, d2, 0.0f);
            __m128 vg = _mm_setr_ps(g0, g1, g2, 0.0f);
            vc = _mm_fmadd_ps(vg, vdm, vd);

            // Store data to memory
            _MM_EXTRACT_FLOAT(a[i].c, vc, 0);
            _MM_EXTRACT_FLOAT(a[i 1].c, vc, 1);
            _MM_EXTRACT_FLOAT(a[i 2].c, vc, 2);

            // d[j] = 0.3334f / sqrtf(vf[j]*vf[j]   vc[j]*vc[j])
            __m128 vf = _mm_setr_ps(f0, f1, f2, 0.0f);
            __m128 vf2 = _mm_mul_ps(vf, vf);
            __m128 vc2 = _mm_mul_ps(vc, vc);
            __m128 vsum = _mm_add_ps(vf2, vc2);
            vd = _mm_mul_ps(_mm_set1_ps(0.3334f), _mm_rsqrt_ps(vsum));

            // Store data to memory
            _MM_EXTRACT_FLOAT(a[i].d, vd, 0);
            _MM_EXTRACT_FLOAT(a[i 1].d, vd, 1);
            _MM_EXTRACT_FLOAT(a[i 2].d, vd, 2);

            // if(f[j]   a[j] > 1e-3f) f[j] = (f[j] - a[j]) * (f[j]   a[j]);
            __m128 vfpa = _mm_add_ps(vf, va);
            __m128 vcond = _mm_cmpgt_ps(vfpa, _mm_set1_ps(1e-3f));
            __m128 vfma = _mm_sub_ps(vf, va);
            vf = _mm_blendv_ps(vf, _mm_mul_ps(vfma, vfpa), vcond);

            // Store data to memory
            _MM_EXTRACT_FLOAT(a[i].f, vf, 0);
            _MM_EXTRACT_FLOAT(a[i 1].f, vf, 1);
            _MM_EXTRACT_FLOAT(a[i 2].f, vf, 2);

            // Useful for the next iteration not to reload values from memory
            vdm = vd;
            vfm = vf;
            bm1 = b2;
        }
    }

    // Remaining part
    for (; i<SIZE;   i)
    {
        a[i].a = (a[i-1].b * 0.42f   a[i-3].d * 0.32f);
        a[i].b = a[i].c * 3.56f - a[i-3].f;
        a[i].c = a[i].d   a[i].g*a[i-3].d;
        a[i].d = 0.3334f / sqrtf(a[i].f*a[i].f   a[i].c*a[i].c);

        if (a[i].f   a[i].a > 1e-3f)
            a[i].f = (a[i].f - a[i].a) * (a[i].f   a[i].a);
    }
}

Note that this code try to keep data as much as possible in registers (fast) rather than reloading them from memory (slow). Still, on my machine, this code takes a significant portion of its time reading/writing data from/to memory. This is mainly due to the inefficient memory layout, but also because it is hard for one core to fully saturate the memory.

Results

Here are experimental results on my i5-9600KF processor using GCC 10.3 with the flags -O3 -march=native -ffast-math:

compute (no -ffast-math):  0.444 ms/it
compute:                   0.404 ms/it
compute_laci:              0.318 ms/it
compute_unroll:            0.317 ms/it
compute_sse:               0.254 ms/it
seq optimal lower-bound:   0.190 ms/it  (saturation of the RAM by 1 core)
par optimal lower-bound:   0.170 ms/it  (saturation of the RAM)

compute_sse is mostly memory-bound on my machine as it achieves to reach a good throughput of 23.5 GiB/s, while the maximum is generally 31-32 GiB/s (for read/writes) using 1 core, and never more than 34-36 GiB/s (for read/writes) using multiple cores in practice. A better memory-layout should be enough to get an execution time very close to the optimal execution time using 1 core.

Note that server processors like Intel Xeon tends to clearly not saturate the RAM bandwidth due to the way the architecture is designed (they are meant to run parallel codes which scale). In fact, the practical RAM throughput a core can achieve is generally significantly smaller than the one on a mainstream PC. Thus, this can can be less efficient on such server processor. See this answer for more information.