I am new to CUDA programming. I am curious that what happens if the number of elements is larger than the number of threads?
In this simple vector_add example
__global__
void add(int n, float *x, float *y)
{
int i = blockIdx.x * blockDim.x threadIdx.x;
if (i < n)
y[i] = x[i] y[i];
}
Say the number of array elements is 10,000,000. And we call this function using 64 blocks and 256 threads per block:
int n = 1e8;
int grid_size = 64;
int block_sie = 256;
Then, only 64*256 = 16384 threads are assigned, what would happen to the rest of the array elements?
CodePudding user response:
what would happen to the rest of the array elements?
Nothing at all. They wouldn't be touched and would remain unchanged. Of course, your x
array elements don't change anyway. So we are referring to y
here. The values of y[0..16383]
would reflect the result of the vector add. The values of y[16384..9999999]
would be unchanged.
For this reason (to conveniently handle arbitrary data set sizes independent of the chosen grid size), people sometimes suggest a grid-stride-loop kernel design.