Home > Software engineering >  "Intrinsics" possible on GPU on OpenGL?
"Intrinsics" possible on GPU on OpenGL?

Time:01-01

I had this idea for something "intrinsic-like" on OpenGL, but googeling around brought no results.

So basically I have a Compute Shader for calculating the Mandelbrot set (each thread does one pixel). Part of my main-function in GLSL looks like this:

float XR, XI, XR2, XI2, CR, CI;
uint i;
CR = float(minX   gl_GlobalInvocationID.x * (maxX - minX) / ResX);
CI = float(minY   gl_GlobalInvocationID.y * (maxY - minY) / ResY);
XR = 0;
XI = 0;
for (i = 0; i < MaxIter; i  )
{
    XR2 = XR * XR;
    XI2 = XI * XI;
    XI = 2 * XR * XI   CI;
    XR = XR2 - XI2   CR;
    if ((XR * XR   XI * XI) > 4.0)
    {
        break;
    }
}

So my thought was using vec4's instead of floats and so doing 4 calculations/pixels at once and hopefully get a 4x speed-boost (analog to "real" CPU-intrinsics). But my code seems to run MUCH slower than the float-version. There are still some mistakes in there (if anyone would still like to see the code, please say so), but I don't think they are what slows down the code. Before I try around for ages, can anybody tell me right away, if this endeavour is futile?

CodePudding user response:

CPUs and GPUs work quite differently.

CPUs need explicit vectorization in the machine code, either coded explicitly by the programmer (through what you call 'CPU-intrisnics') or automatically vectorized by the compiler.

GPUs, on the other hand, vectorize by means of running multiple invocations of your shader (aka kernel) on their cores.

AFAIK, on modern GPUs, additional vectorization within a thread is neither needed nor supported: instead of manufacturing a single core that can add 4 floats at a time (for example), it's more beneficial to have four times as many simpler cores that can operate on a single float at a time each. The reason it's better is because for code working with vectors you'd still get the same throughput either way. However, for code that operates on scalars, that extra circuitry implementing the vector instruction would not be wasted. Instead, because it's split between multiple cores, it will be able to execute multiple instances of your shader. Most code, by means of necessity, will have at least some scalar computations in it.

The bottom line is, that your code is already likely to utilize the GPU resources to their maximum.

  • Related