CUDA optimise number of blocks for grid stride loop-CodePudding

I have started implementing a simple 1D array calculation using CUDA. Following the documentation I have first tried to define an optimal number of blocks and block size

...
int N_array = 1000000
...
int n_threads = 256;

int n_blocks = ceil(float(N_array / n_threads));
dim3 grid(n_blocks, 1, 1);
dim3 block(n_threads, 1, 1);
...

For the kernel, I have used a grid-stride approach as suggested in the nvidia blog

...
int global_idx = blockIdx.x * blockDim.x   threadIdx.x;
int stride = gridDim.x * blockDim.x;
int threadsInBlock = blockDim.x;

for (unsigned long long n = global_idx; n < N_array; n  = stride) {
    ...

My questions are:

Is it fine to define the number of blocks as before? Or should they be defined such that the total number of requested threads is smaller than the number of available CUDA cores? (thinking that blocks in this way will take advantage of the grid-stride loop by doing more calculations).
Since for this large array the number of requested threads is larger than the number of CUDA cores, is there any penalty on having many blocks inactive? Compared to requesting less blocks and keeping most of them active? (this is related to 1.)

CodePudding user response：

Conventional wisdom is that the number of threads in the grid for a grid-stride loop should be sized to roughly match the thread-carrying capacity of the GPU in question. The reason for this is to maximize the exposed parallelism, which is one of the 2 most important objectives for any CUDA programmer. This gives the machine the maximum opportunity to do latency hiding. This is not the same as the number of CUDA cores. Divorce yourself from thinking about the number of CUDA cores in your GPU for these types of design questions. The number of CUDA cores is not relevant to this inquiry.

The thread carrying capacity of the GPU, ignoring occupancy limiters, is the number of SMs in the GPU times the maximum number of threads per SM.

Both of these quantities can be retrieved programmatically, and the deviceQuery sample code demonstrates how.

If you want to be more precise, you can do an occupancy analysis on your kernel to determine the maximum number of threads that can actually be resident on a SM, then multiply this by the number of SMs. Occupancy analysis can be done statically, using the occupancy calculator spreadsheet provided as part of the CUDA toolkit, or dynamically using the occupancy API. (You can also inspect/measure occupancy after the fact with the nsight compute profiler.) There are many questions already here on the cuda SO tag discussing these things, and it is covered in the programming guide, so I'll not provide an occupancy tutorial here. The number you arrive at via occupancy analysis is upper-bounded by the calculation of number of SMs times max threads per SM.

You will want to choose threads per block and number of block values based on that which allows the maximums to be achieved. For example, on a cc8.6 GPU with 1536 maximum threads per SM, you would want to choose perhaps 512 threads per block, and then a number of blocks equal to 3 times the number of SMs in your GPU. You could also choose 256 threads per block and 6 times the number of SMs. Choosing a value of 1024 threads per block, in this particular example, and ignoring occupancy considerations, might not be a good choice.