This question asks the difference between __device__
and __global__
.
The difference is:
__device__
functions can be called only from the device, and it is executed only in the device.
__global__
functions can be called from the host, and it is executed in the device.
I interpret the difference between __global__
and __device__
to be similar to public
and private
class access specifiers. The point is to prevent accidentally calling a __device__
function from the host. It sounds like I could tag all void
-returning functions as __global__
without changing program behavior. Would this change program performance?
CodePudding user response:
Yes, __global__
has overhead, compared to __device__
, but there are additional details to be aware of. What you're proposing probably isn't a good idea.
__global__
is the device code entrypoint, from host code. Initially, the GPU has no code running on it. When your host code decides that it wants to start some processing on the GPU, this can only be done by calling a __global__
function (that is the so-called kernel launch).
You can call a __global__
function from device code, but that is invoking something called CUDA Dynamic Parallelism which has all the attributes of a kernel launch. If you're a beginner, you almost certainly don't want to do that.
If you have code running on the GPU, and you want to call a function in the context of a CUDA thread, the way to do that is by calling a __device__
function.
It sounds like I could tag all void-returning functions as
__global__
without changing program behavior. Would this change program performance?
It would change both behavior and performance.
When you call a __global__
function (whether from host code or device code) the only way to do that is via a properly configured kernel launch. Using the typical method in the CUDA runtime API, that would be:
kernel<<<blocks, threads, ...>>>(... arguments ...);
That stuff in the triple-chevron syntax makes it different from an ordinary function call, and it will behave differently. It will launch a new kernel, with its own grid (the complement of threads/blocks associated with a kernel launch).
When you call a __device__
function, it looks like an ordinary function call:
func(... arguments ...);
and behaves like one also. It operates within the context of single thread, and does not spin up any new threads/blocks/kernel to service the function call.
You might want to spend a few hours in an orderly introduction to the topic. Just a suggestion, do as you wish of course.