How to copy 2d array in cuda?-CodePudding

I am new to cuda and still trying to figure things out, so this question maybe dumb but I can't seem to figure out the problem so bare with me.

I am trying to copy a 2d array to the GPU. The size of the array is N*N (square array). I'm trying to copy it using MallocPitch() & cudaMemcpy2D(). The problem is I seem to only be copying the first row of the array and nothing else. I can't find what exactly im doing wrong.

My code:

void function(){
   double A[N][N];
       //code to fill out the array.
 
   double* d_A;
   size_t pitch;
   cudaMallocPitch(&d_A, &pitch, N * sizeof(double), N);
   cudaMemcpy2D(d_A, pitch, A, N * sizeof(double) , N * sizeof(double), N, cudaMemcpyHostToDevice);

   int threadnum = 1;
   int blocksnum = 1; 
   
   kernal_print<<<blocknum, threadnum>>>(d_A, N); 
   
   //copying back to host & freeing up memory

}

__global__ void kernal_print(double* d_A, N){
   int xIdx = threadIdx.x   blockDim.x * blockIdx.x; 
   int yIdx = threadIdx.y   blockDim.y * blockIdx.y;

   printf("\n");
   for(int i = 0; i < N*N; i  ){
       printf("%f, ",d_A[i]);
   }
   printf("\n");
}

The code above will only print the first row of whatever matrix I have. So for example a 3x3 matrix that looks like this:

1 2 3
4 5 6
7 8 9

the code will print (1 2 3 0 0 0 0 0 0)

Any idea of what Iam doing wrong? Thanks in advance!

CodePudding user response：

This question may be useful for background.

Perhaps you don't know what a pitched allocation is. A pitched allocation looks like this:

X  X  X  P  P  P
X  X  X  P  P  P
X  X  X  P  P  P

The above could represent storage for a 3x3 array (elements represented by X) that is pitched (pitched value of 6 elements, pitch "elements" represented by P).

You'll have no luck accessing such a storage arrangement if you don't follow the guidelines given in the reference manual for cudaMallocPitch. In-kernel access to such a pitched allocation should be done as follows:

T* pElement = (T*)((char*)BaseAddress   Row * pitch)   Column;

You'll note that the above formula depends on the pitch value that was provided at the point of cudaMallocPitch. If you don't pass that value to your kernel, you won't have any luck with this.

Because you are not doing that, the proximal reason for your observation:

the code will print (1 2 3 0 0 0 0 0 0)

is because your indexing is reading just the first "row" of that pitched allocation, and the P elements are showing up as zero (although that's not guaranteed.)

We can fix your code simply by implementing the suggestions given in the reference manual:

$ cat t2153.cu
#include <cstdio>
const size_t N = 3;
__global__ void kernal_print(double* d_A, size_t my_N, size_t pitch){
//   int xIdx = threadIdx.x   blockDim.x * blockIdx.x;
//   int yIdx = threadIdx.y   blockDim.y * blockIdx.y;

   printf("\n");
   for(int row = 0; row < my_N; row  )
     for (int col = 0; col < my_N; col  ){
       double* pElement = (double *)((char*)d_A   row * pitch)   col;
       printf("%f, ",*pElement);
     }
   printf("\n");
}

void function(){
   double A[N][N];
   for (size_t row = 0; row < N; row  )
     for (size_t col = 0; col < N; col  )
       A[row][col] = row*N col 1;
   double* d_A;
   size_t pitch;
   cudaMallocPitch(&d_A, &pitch, N * sizeof(double), N);
   cudaMemcpy2D(d_A, pitch, A, N * sizeof(double) , N * sizeof(double), N, cudaMemcpyHostToDevice);

   int threadnum = 1;
   int blocknum = 1;

   kernal_print<<<blocknum, threadnum>>>(d_A, N, pitch);
   cudaDeviceSynchronize();
}

int main(){

  function();
}
$ nvcc -o t2153 t2153.cu
$ compute-sanitizer ./t2153
========= COMPUTE-SANITIZER

1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 6.000000, 7.000000, 8.000000, 9.000000,
========= ERROR SUMMARY: 0 errors
$

A few comments:

The usage of the term 2D can have varied interpretations.
Using a pitched allocation is not necessary for 2D work, and it may also have no practical value (not making your code simpler or more performant).
For further discussion of the varied ways of doing "2D work", please read the answer I linked.
This sort of allocation: double A[N][N]; may give you trouble for large N, because it is a stack-based allocation. Instead, use a dynamic allocation (which may affect a number of the methods you use to handle it.) There are various questions covering this, such as this one.