How to understand the coalesced access in this CUDA matrix copy code?-CodePudding

__global__ void Matrixcopy(float *odata, const float *idata)
{
  // threadblock size = (TILE_DIM, BLOCK_ROWS) = (32, 8)
  // each block copies a 32 * 32 tile
  int x = blockIdx.x * TILE_DIM   threadIdx.x;
  int y = blockIdx.y * TILE_DIM   threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j = BLOCK_ROWS)
    odata[(y j)*width   x] = idata[(y j)*width   x];
}

I'm quite confused about the concept of coalesced access for the multi-dim arrays. The definition of the coalesced global memory access is

Sequential memory access is adjacent

according to the literature Learn CUDA Programming. For 1-dim arrays, It's easy to understand the threads are indexed as threadIdx.x blockDim.x * blockIdx.x, which could easily be mapped to the real 1-dim array: the adjacent threads within a warp access the adjacent physical address of the 1-dim array.

Yet for 2-dim array or a matrix and a 2-dim threadblock like the code above, I'm not sure if I understand it correctly: The adjacent threads within a warp are located in the same row, i.e. same y value, different x values. If y = 0, the contiguous threads are x = [1, 2, 3, 4, 5, 6...], and they access contiguous address [1, 2, 3, 4, 5, 6...] if j = 0. So this code is with coalesced access. Am I correctly understood? And this is just a simple cuda code, if we have a complex cuda kernel, how can we quickly determine whether an access is coalesced or not?

CodePudding user response：

So this code is with coalesced access. Am I correctly understood?

Yes, pretty much. I would have said the threads are x = [0, 1, 2, 3, 4, 5, 6...], and they access contiguous addresses [0, 1, 2, 3, 4, 5, 6...] but basically we are in agreement.

if we have a complex cuda kernel, how can we quickly determine whether an access is coalesced or not?

You can look at any index construction and use the following test: If the threadIdx.x variable is included in the index as an additive factor, and it has no multiplicative factors on it, then the access will coalesce in typical usage (where you have square-ish threadblocks). Any index that can be expressed as:

idx = f   threadIdx.x

where f is arbitrary, but does not include threadIdx.x will result in coalesced access. Adjacent threads "in x" will access adjacent locations in memory. For "non-square-ish" threadblocks, you can develop a similar rule with threadIdx.y. For example, a threadblock of dimensions (1,32) will require that threadIdx.y be included as an additive-only factor.