Optimizing raster algorithm in OpenCL 32seconds for a cube on nVidia RTX 3080-CodePudding

I'm new in OpenCl. I wrote an OpenCL software rasterizer to rasterize triangles. Now the time that is used for a cube is 32Seconds, which is too much, I'm testing on nVidia RTX3080 Laptop. The result is very weird and it's too slow.

Here is the kernel,

    ___kernel void fragment_shader(__global struct Fragment* fragments, __global struct Triangle_* triangles, int triCount)
{

    size_t px = get_global_id(0); // triCount
    //size_t py = get_global_id(1); // triCount
    int imageWidth = 256;
    int imageHeight = 256;
    if(px < triCount)
    { 
    float3 v0Raster = (float3)(triangles[px].v[0].pos[0], triangles[px].v[0].pos[1], triangles[px].v[0].pos[2]);
    float3 v1Raster = (float3)(triangles[px].v[1].pos[0], triangles[px].v[1].pos[1], triangles[px].v[1].pos[2]);
    float3 v2Raster = (float3)(triangles[px].v[2].pos[0], triangles[px].v[2].pos[1], triangles[px].v[2].pos[2]);


    float xmin = min3(v0Raster.x, v1Raster.x, v2Raster.x);
    float ymin = min3(v0Raster.y, v1Raster.y, v2Raster.y);
    float xmax = max3(v0Raster.x, v1Raster.x, v2Raster.x);
    float ymax = max3(v0Raster.y, v1Raster.y, v2Raster.y);
    float slope = (ymax - ymin) / (xmax - xmin);
    // be careful xmin/xmax/ymin/ymax can be negative. Don't cast to uint32_t
    unsigned int x0 = max((uint)0, (uint)(floor(xmin)));
    unsigned int x1 = min((uint)(imageWidth) - 1, (uint)(floor(xmax)));
    unsigned int y0 = max((uint)0, (uint)(floor(ymin)));
    unsigned int y1 = min((uint)(imageHeight) - 1, (uint)(floor(ymax)));

    float3 v0 = v0Raster;
    float3 v1 = v1Raster;
    float3 v2 = v2Raster;
    float area = edgeFunction(v0Raster, v1Raster, v2Raster);

    for (unsigned int y = y0; y <= y1;   y) {
        for (unsigned int x = x0; x <= x1;   x) {
            float3 p = { x   0.5f, y   0.5f, 0 };
            float w0 = edgeFunction(v1Raster, v2Raster, p);
            float w1 = edgeFunction(v2Raster, v0Raster, p);
            float w2 = edgeFunction(v0Raster, v1Raster, p);
            if (w0 >= 0 && w1 >= 0 && w2 >= 0) {
                
                    fragments[y * 256   x].col[0] = 1.0f;
                    fragments[y * 256   x].col[1] = 0;
                    fragments[y * 256   x].col[2] = 0;

              }
            }
        }
    }


}

The kernel is supposed to run for every triangle, and does box testing and rasterize the pixels. here is how I invoke it:

 global_size[0] = triCount-1;
  auto time_start = std::chrono::high_resolution_clock::now();

  err = clEnqueueNDRangeKernel(commandQueue, kernel_fragmentShader, 1, NULL, global_size,
            NULL, 0, NULL, NULL);
        if (err < 0) {
            perror("Couldn't enqueue the kernel_fragmentShader");
            exit(1);
        }

I tried to omit lighting and everything still it takes around 20seconds to render a cube.

CodePudding user response：

This kind of approach is well suited for massively parallel rendering like on GPU. I assume you are doing this on CPU side so the performance is poor as you have no or too small parallelization and no or very little HW support for used operations. On GPU you got SIMD instructions for most of the stuff needed and a lot of is done in HW instead of in code).

To gain speed on CPU side see how to rasterize rotated rectangle this was standard way of SW rendering back in the days before GPUs. The method simply renders edges of convex polygon (or triangle) as lines into 2 buffers (start end poins per horizontal line) and then just fill or interpolate the horizontal lines. This uses much much less operations per pixel.

Your method computes point inside triangle per each pixel of BBOX which meas much more pixels are processed and each pixel need too many complicated operations which kills performance.

On top of this your code is not optimized for example

fragments[y * 256   x].col[0] = 1.0f;
fragments[y * 256   x].col[1] = 0;
fragments[y * 256   x].col[2] = 0;

Why are you computing y * 256 x 3 times? also I would feel better with (y<<8) x but nowadays compilers should do it for you. You can also just add 256 to starting address instead of multiplying...

I do not code in OpenCL (IIRC its for computer vision and DIP not for rendering) so I hope you have direct access to fragments[] and not some constrained with additional test which kills performance a lot (similar to putpixel,setpixel,pixel[][],etc. on most gfx apis which can kill performance even up to 10000x times)