Can I calculate Speedup for OpenCL kernels with templates and std::index

tldr; How do I implement a for loop that runs a timed function with std::index_sequence?

Okay, I'll admit that title is a little cryptic but I was looking at this question: is that possible to have a for loop in compile time with runtime or even compile?

And I may have gotten too excited with what I could possibly do with std::index_sequence. I'll explain what my goal is. I want something like the following code:

for(int i = 1; i < 100000;   i) 
{
    auto start = time();
    runOpenCL<i>();
    std::cout << time() - start << std::endl;
}

to compile to this (with the timers for each one):

runOpenCL<1>();
runOpenCL<2>();
runOpenCL<3>();
...
runOpenCL<100000>();

Now I thought this should just work right? Since for loops are often interpreted at compile time (if that's the right phrase) in this way. However, I understand templates have certain safeguards against this possible dodgy code so I saw that std::index_sequence could get around that, but I don't have enough of an understanding of template code to figure out whats going on. Now before anyone says I could just make it a normal function parameter and yes I could do that, if you look at the function itself:

    template<int threadcount>
    INLINE void runOpenCL()
    {
        constexpr int itemsPerThread = (MATRIX_HEIGHT   threadcount - 1) / threadcount;
    
        // executing the kernel
        clObjs.physicsKernel.setArg(2, threadcount);
        clObjs.physicsKernel.setArg(3, itemsPerThread);
    
        clObjs.queue.enqueueNDRangeKernel(clObjs.physicsKernel, cl::NullRange, cl::NDRange(threadcount), cl::NullRange);
        clObjs.queue.finish();
        
        // making sure OpenGL is finished with its vertex buffer
        glFinish();
        
        // acquiring the OpenGL object (vertex buffer) for OpenCL use
        const std::vector<cl::Memory> glObjs = { clObjs.glBuffer };
        clObjs.queue.enqueueAcquireGLObjects(&glObjs);
        
        // copying the OpenCL buffer to the BufferGL
        clObjs.queue.enqueueCopyBuffer(clObjs.outBuffer, clObjs.glBuffer, 0, 0, planets_size_points);
    
        // releasing the OpenGL object
        clObjs.queue.enqueueReleaseGLObjects(&glObjs);
    }

but I don't want to. Do I need a better reason? I think it would be really cool to implement this. Provided it is still readable in the end.

CodePudding user response：

Here is a possible version that will unfold the loop using C 17 fold expression:

#include <type_traits>
#include <utility>

template <std::size_t I>
void runOpenCL();

template <std::size_t... Is>
void runAllImpl(std::index_sequence<Is... >) {
    // thanks @Franck for the better fold expression
    (runOpenCL<Is>(), ...);
}

void runAll() {
    runAllImpl(std::make_index_sequence<10000>{});
}

Without C 17 you can do something like this but in non-optimized build you will get a huge stack blow-up:

#include <type_traits>
#include <utility>

template <std::size_t I>
void runOpenCL();

template <std::size_t... Is>
void runAllImpl(std::index_sequence<Is... >) {
    int arr[]{ (runOpenCL<Is>(), 0)... };
    (void)arr;
}

void runAll() {
    runAllImpl(std::make_index_sequence<10000>{});
}

This seems to work with larger value than @康桓瑋's proposition but (at least) GCC does not manage to compile for 1000000 (10000 is "ok").

CodePudding user response：

You can generate a fixed-size function table at compile-time, and invoke the corresponding function in the table through runtime index. For example like this:

#include <array>

template<std::size_t N>
constexpr auto gen_func_table = []<std::size_t... Is>
  (std::index_sequence<Is...>) {
  return std::array{ [] { runOpenCL<Is>(); }...};
}(std::make_index_sequence<N>{});

int main() {
  constexpr std::size_t max_count = 100;
  constexpr auto& func_table = gen_func_table<max_count>;
  for(int i = 1; i < max_count;   i)
    func_table[i]();
}

Demo.