Does __global__ have overhead over __device_

This question asks the difference between __device__ and __global__.

The difference is:

__device__ functions can be called only from the device, and it is executed only in the device.

__global__ functions can be called from the host, and it is executed in the device.

I interpret the difference between __global__ and __device__ to be similar to public and private class access specifiers. The point is to prevent accidentally calling a __device__ function from the host. It sounds like I could tag all void-returning functions as __global__ without changing program behavior. Would this change program performance?

CodePudding user response：

Yes, __global__ has overhead, compared to __device__, but there are additional details to be aware of. What you're proposing probably isn't a good idea.

__global__ is the device code entrypoint, from host code. Initially, the GPU has no code running on it. When your host code decides that it wants to start some processing on the GPU, this can only be done by calling a __global__ function (that is the so-called kernel launch).

You can call a __global__ function from device code, but that is invoking something called CUDA Dynamic Parallelism which has all the attributes of a kernel launch. If you're a beginner, you almost certainly don't want to do that.

If you have code running on the GPU, and you want to call a function in the context of a CUDA thread, the way to do that is by calling a __device__ function.

It sounds like I could tag all void-returning functions as __global__ without changing program behavior. Would this change program performance?

It would change both behavior and performance.

When you call a __global__ function (whether from host code or device code) the only way to do that is via a properly configured kernel launch. Using the typical method in the CUDA runtime API, that would be:

kernel<<<blocks, threads, ...>>>(... arguments ...);

That stuff in the triple-chevron syntax makes it different from an ordinary function call, and it will behave differently. It will launch a new kernel, with its own grid (the complement of threads/blocks associated with a kernel launch).

When you call a __device__ function, it looks like an ordinary function call:

func(... arguments ...);

and behaves like one also. It operates within the context of single thread, and does not spin up any new threads/blocks/kernel to service the function call.

You might want to spend a few hours in an orderly introduction to the topic. Just a suggestion, do as you wish of course.