__global__ void Matrixcopy(float *odata, const float *idata)
{
// threadblock size = (TILE_DIM, BLOCK_ROWS) = (32, 8)
// each block copies a 32 * 32 tile
int x = blockIdx.x * TILE_DIM threadIdx.x;
int y = blockIdx.y * TILE_DIM threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j = BLOCK_ROWS)
odata[(y j)*width x] = idata[(y j)*width x];
}
I'm quite confused about the concept of coalesced access for the multi-dim arrays. The definition of the coalesced global memory access is
Sequential memory access is adjacent
according to the literature Learn CUDA Programming.
For 1-dim arrays, It's easy to understand the threads are indexed as threadIdx.x blockDim.x * blockIdx.x
, which could easily be mapped to the real 1-dim array: the adjacent threads within a warp access the adjacent physical address of the 1-dim array.
Yet for 2-dim array or a matrix and a 2-dim threadblock like the code above, I'm not sure if I understand it correctly: The adjacent threads within a warp are located in the same row, i.e. same y value, different x values. If y = 0, the contiguous threads are x = [1, 2, 3, 4, 5, 6...], and they access contiguous address [1, 2, 3, 4, 5, 6...] if j = 0. So this code is with coalesced access. Am I correctly understood? And this is just a simple cuda code, if we have a complex cuda kernel, how can we quickly determine whether an access is coalesced or not?
CodePudding user response:
So this code is with coalesced access. Am I correctly understood?
Yes, pretty much. I would have said the threads are x = [0, 1, 2, 3, 4, 5, 6...], and they access contiguous addresses [0, 1, 2, 3, 4, 5, 6...] but basically we are in agreement.
if we have a complex cuda kernel, how can we quickly determine whether an access is coalesced or not?
You can look at any index construction and use the following test: If the threadIdx.x
variable is included in the index as an additive factor, and it has no multiplicative factors on it, then the access will coalesce in typical usage (where you have square-ish threadblocks). Any index that can be expressed as:
idx = f threadIdx.x
where f
is arbitrary, but does not include threadIdx.x
will result in coalesced access. Adjacent threads "in x" will access adjacent locations in memory. For "non-square-ish" threadblocks, you can develop a similar rule with threadIdx.y
. For example, a threadblock of dimensions (1,32) will require that threadIdx.y
be included as an additive-only factor.