Home > Software design >  C CUDA: Why aren't my block dimensions working?
C CUDA: Why aren't my block dimensions working?

Time:12-13

I'm using an example from a book to solve a 4x4 matrix multiplication. However, the book only provides the kernel code, so the rest of the program is down to me. The book says to use a block width of 2, however I cannot get this to work with dim3 variables. Here is the kernel:

__global__ void matmul_basic(float *c, float *a, float *b, unsigned int width)
{   
    printf("Width: %d\n", width);
    printf("BlockDim.x: %d, BlockDim.y: %d, BlockDim.z: %d\n", blockDim.x, blockDim.y, blockDim.z);
    printf("GridkDim.x: %d, GridDim.y: %d, GridDim.z: %d\n", gridDim.x, gridDim.y, gridDim.z);
    printf("Blockidx.x: %d, Blockidx.y: %d, Blockidx.z: %d\n", blockIdx.x, blockIdx.y, blockIdx.z);
    printf("threadIdx.x %d, threadIdx.y: %d, threadIdx.z: %d\n", threadIdx.x, threadIdx.y, threadIdx.z);
    // Calculate the row index of the c element and a
    int Row = blockIdx.y * blockDim.y   threadIdx.y;
    // Calculate the column index of c and b
    int Col = blockIdx.x * blockDim.x   threadIdx.x;
    // Sense check
    printf("Row: %d\tCol: %d\n", Row, Col);

    if ((Row < width) && (Col < width)) {
        float Pvalue = 0;
        // each thread computes one element of the block sub-matrix
        for (size_t k = 0; k < width; k  )
        {
            Pvalue  = a[Row * width   k] * b[k * width   Col];
        }
        c[Row * width   Col] = Pvalue;
    }
    else {
        printf("Dimensions out of bounds. Row: %d, Col: %d\n", Row, Col);
    }
}

I know the print statements are excessive, but I'm just trying to verify the dimensions. Here are the dimensions in the function call:

dim3 dimGrid = (1, 1, 1);
dim3 dimBlock = (2, 2, 1);
matmul_basic <<<dimGrid, dimBlock>>> (d_c, d_a, d_b, width);

This should be a single block of dimension 2x2 threads? And finally, here is the readout:

Width: 4
BlockDim.x: 1, BlockDim.y: 1, BlockDim.z: 1
GridkDim.x: 1, GridDim.y: 1, GridDim.z: 1
Blockidx.x: 0, Blockidx.y: 0, Blockidx.z: 0
threadIdx.x 0, threadIdx.y: 0, threadIdx.z: 0
Row: 0  Col: 0
Kernel Complete, transferring results...
20218 -1.07374e 08 -1.07374e 08 -1.07374e 08
-1.07374e 08 -1.07374e 08 -1.07374e 08 -1.07374e 08
-1.07374e 08 -1.07374e 08 -1.07374e 08 -1.07374e 08
-1.07374e 08 -1.07374e 08 -1.07374e 08 -1.07374e 08

So it never goes past the very first thread, and it thinks the block size is 1x1x1? It also never goes to the else statement which would indicate it is outside the matrix dimensions.

I'm sure I'm doing something stupid or I'm misunderstanding how the dimensions work. Any help would be greatly appreciated. Thanks!

EDIT: Adding width initialisation and readout from printf statement:

initialisation:

// Determine matrix dimensions
const int width = 4;

Readout in original section above has been edited to include width.

CodePudding user response:

it thinks the block size is 1x1x1?

Yes.

Why aren't my block dimensions working?

Because this:

dim3 dimBlock = (2, 2, 1);

is not doing what you think it is doing, and it is not the proper way to initialize a dim3 variable. You might want to spend some time thinking about what the expression (2,2,1) evaluates to in C . Underneath the hood, a dim3 variable is a struct with 3 components. You can't set all 3 components of a 3-element struct that way in C .

Anyhow, you'll have better luck with something like this, which invokes a constructor to set the values:

dim3 dimBlock(2, 2, 1);

or this, which doesn't:

dim3 dimBlock; 
dimBlock.x = 2;
dimBlock.y = 2;
dimBlock.z = 1;

I would also point out that for a 4x4 problem, your grid sizing is not correct either, but you will probably figure that out.

  • Related