Home > Blockchain >  Strange behaviors of cuda kernel with infinite loop on different NVIDIA GPU
Strange behaviors of cuda kernel with infinite loop on different NVIDIA GPU

Time:01-02

#include <cstdio>
__global__ void loop(void) {
    int smid = -1;
    if (threadIdx.x == 0) {
        asm volatile("mov.u32 %0, %%smid;": "=r"(smid));
        printf("smid: %d\n", smid);
    }
    while (1);
}

int main() {
    loop<<<1, 32>>>();
    cudaDeviceSynchronize();
    return 0;
}

This is my source code, the kernel just print smid when thread index is 0 and then go to infinite loop, and the host just invoke the previous cuda kernel and wait for it. I run some experiments under 2 different configurations as following:

  • 1. GPU(Geforce 940M) OS(Ubuntu 18.04) MPS(Enable) CUDA(v11.0)
  • 2. GPU(Geforce RTX 3050Ti Mobile) OS(Ubuntu 20.04) MPS(Enable) CUDA(v11.4)

Experiment 1: When I run this code under configuration 1, the GUI system seems to get freezed because any graphical responses cannot be observed anymore, but as I press ctrl c, this phenomena disappears as the CUDA process is killed.

Experiment 2: When I run this code under configuration 2, the system seems to work well without any abnormal phenomena, and the output of smid such as smid: 2\n can be displayed.

Experiment 3: As I change the block configuration loop<<<1, 1024>>> and run this new code twice under configuration 2, I get the same smid output such as smid: 2\nsmid: 2\n.(As for Geforce RTX 3050Ti Mobile, the amount of SM is 20, the maximum number of threads per multiprocessor is 1536 and max number of threads per block is 1024.)

I'm confused with these results, and here are my questions:

  • 1. Why doesn't the system output smid under configuration 1?
  • 2. Why does the GUI system seems to get freezed under configuration 1?
  • 3. Unlike experiment 1, why does experiment 2 output smid normally?
  • 4. In third experiment, the block configuation reaches to 1024 threads, which means that two different block cannot be scheduled to the same SM. Under MPS environment, all CUDA contexts will be merged into one CUDA context and share the GPU resource without timeslice anymore, but why do I still get same smid in the third experiment?(Furthermore, as I change the grid configuration into 10 and run it twice, the smid varies from 0 to 19 and each smid just appears once!)

CodePudding user response:

  1. Why doesn't the system output smid under configuration 1?

A safe rule of thumb is that unlike host code, in-kernel printf output will not be printed to the console at the moment the statement is encountered, but at the point of completion of the kernel and device synchronization with the host. This is the actual regime in effect in configuration 1, which is using a maxwell gpu. So no printf output is observed in configuration 1, because the kernel never ends.

  1. Why does the GUI system seems to get freezed under configuration 1?

For the purpose of this discussion, there are two possible regimes: a pre-pascal regime in which compute-preemption is not possible, and a post-pascal regime in which it is possible. Your configuration 1 is a maxwell device, which is pre-pascal. Your configuration 2 is ampere device, which is post-pascal. So in configuration 2, compute preemption is working. This has a variety of impacts, one of which is that the GPU will service both GUI needs as well as compute kernel needs, "simultaneously" (the low level behavior is not thoroughly documented but is a form of time-slicing, alternating attention to the compute kernel and the GUI). Therefore in config 1, pre-pascal, kernels running for any noticeable time at all will "freeze" the GUI during kernel execution. In config2, the GPU services both, to some degree.

  1. Unlike experiment 1, why does experiment 2 output smid normally?

Although its not well-documented, the compute preemption process appears to introduce an additional synchronization point, allowing for the flushing of the printf buffer, as mentioned in point 1. If you read the documentation I linked there, you will see that "synchronization point" covers a number of possibilities, and compute preemption seems to introduce (a new) one.

Sorry, won't be able to answer your 4th question at this time. A best practice on SO is to ask one question per question. However, I would consider usage of MPS with a GPU that is also servicing a display to be "unusual". Since we've established that compute preemption is in effect here, it may be that due to compute-preemption as well as the need to service a display, the GPU services clients in a round-robin timeslicing fashion (since it must do so anyway to service the display). In that case the behavior under MPS may be different. Compute preemption allows for the possibility of the usual limitations you are describing to be voided. One kernel can completely replace another.

  • Related