How to enqueue as many kernels as there are threads available on version 1.2-CodePudding

so pretty much I'm trying to do some calculations where it starts off with 10-20~ objects, but by doing calculations on these objects it creates 20-40 and so on and so forth, so slightly recursive but not forever, eventually the amount of calculations will reach zero. I have considered using a different tool but its kind of too late for that for me. Its kind of an odd request which is probably why no results came up.

In short I'm wondering how it is possible to set global work size to as many threads as there are available. For example if the GPU can have X different processes running in parallel it will set that to global work size to X.

edit:it would also work if I can call more kernels from the GPU but that doesn't look possible on version 1.2.

CodePudding user response：

There is not really a limit to global work size (only above 2^32 threads you have to use 64-bit ulong to avoid integer overflow), and the hard limit at 2^64 threads is so large that you can never possibly come even close to it.

If you need a billion threads, than set global work size to a billion threads. The GPU scheduler and hardware will handle that just fine, even if the GPU only has a few thousand physical cores. In fact, you should always launch much more threads than there are cores on the GPU; otherwise the hardware won't be fully saturated and you loose performance. Only issue could be to run out of GPU memory.

Launching kernels from within kernels is only possible in OpenCL 2.0-2.2, on AMD or Intel GPUs.

CodePudding user response：

It sounds like each iteration depends on the result of the previous one. In that case, your limiting factor is not the number of available threads. You cannot cause some work-items to "wait" for others submitted by the same kernel enqueueing API call (except to a limited extent within a work group).

If you have an OpenCL 2.0 implementation at your disposal, you can queue subsequent iterations dynamically from within the kernel. If not, and you have established that your bottleneck is checking whether another iteration is required and the subsequent kernel submission, you could try the following:

Assuming a work-item can trivially determine how many threads are actually needed for an iteration based on the output of the previous iteration, you could speculatively enqueue multiple batches of the kernel, each of which depends on the completion event of the previous batch. Inside the kernel, you can exit early if the thread ID is greater or equal the number of threads required in that iteration.

This only works if you either have a hard upper bound or can make a reasonable guess that will yield sensible results (with acceptable perf characteristics if the guess is wrong) for:

The maximum number of iterations.
The number of work-items required on each iteration.

Submitting, say UINT32_MAX work items for each iteration will likely not make any sense in terms of performance, as the number of work-items that fail the check for whether they are needed will dominate.

You can work around incorrect guesses for the latter number by surrounding the calculation with a loop, so that work item N will calculate both item N and M N if the number of items on an iteration exceeds M, where M is the enqueued work size for that iteration.

Incorrect guesses for the number of iterations would need to be detected on the host, and more iterations enqueued.

So it becomes a case of performing a large number of runs with different guesses and gathering statistics on how good the guesses are and what overall performance they yielded.

I can't say whether this will yield acceptable performance in general - it really depends on the calculations you are performing and whether they are a good fit for GPU-style parallelism, and whether the overhead of the early-out for a potentially large number of work items becomes a problem.