Is it safe to use cudaHostRegister on only part of an allocation?-CodePudding

I have a C class container that allocates, lets say, 1GB of memory of plain objects (e.g. built-ins).

I need to copy part of the object to the GPU. To accelerate and simplify the transfer I want to register the CPU memory as non-pageable ("pinning"), e.g. with cudaHostRegister(void*, size, ...) before copying.

(This seems to be a good way to copy further subsets of the memory with minimal logic. For example if plain cudaMemcpy is not enough.)

Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.

I may want to register only part because of efficiency, but also because deep down in the call trace I might have lost information of the original allocated pointer.

In other words, can the pointer argument to cudaHostRegister be the something else other than an allocated pointer? in particular an arithmetic result deriving from allocated memory, but still within the allocated range.

It seems to work but I don't understand if, in general, "pinning" part of an allocation can corrupt somehow the allocated block.

UPDATE: My concern is that allocation is actually mentioned in the documentation for the cudaHostRegister flag options:

cudaHostRegisterDefault: On a system with unified virtual addressing, the memory will be both mapped and portable. On a system with no unified virtual addressing, the memory will be neither mapped nor portable.

cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.

cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().

cudaHostRegisterIoMemory: The passed memory pointer is treated as pointing to some memory-mapped I/O space, e.g. belonging to a third-party PCIe device, and it will marked as non cache-coherent and contiguous.

cudaHostRegisterReadOnly: The passed memory pointer is treated as pointing to memory that is considered read-only by the device. On platforms without cudaDevAttrPageableMemoryAccessUsesHostPageTables, this flag is required in order to register memory mapped to the CPU as read-only. Support for the use of this flag can be queried from the device attribute cudaDeviceAttrReadOnlyHostRegisterSupported. Using this flag with a current context associated with a device that does not have this attribute set will cause cudaHostRegister to error with cudaErrorNotSupported.

CodePudding user response：

This is a rule-of-thumb answer rather than a proper one:

When the CUDA documentation does not guarantee something is guaranteed to work - you'll need to assume it doesn't. Because if it does happen to work - for you, right now, on the system you have - it might stop working in the future; or on another system; or in another usage scenario.

More specifically - memory pinning happens at page resolution, so unless the part you want to pin starts and ends on a physical page boundary, the CUDA driver will need to pin some more memory before and after the region you asked for - which it could do, but it's going an extra mile to accommodate you, and I doubt that would happen without documentation.

I also suggest you file a bug report via developer.nvidia.com , asking that they clarify this point in the documentation. My experience is that there's... something like a 50% chance they'll do something about such a bug report.

Finally - you could just try it: Write a program which copies to the GPU with and without the pinning of the part-of-the-region, and see whether there's a throughput difference.

CodePudding user response：

Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.

While I agree that the documentation could be clearer, I think the answer to the question is 'Yes'.

Here's why: The alternative interpretation would be that only whole memory sections returned by, say, malloc should be allowed to be registered. However, this is unworkable, because malloc could, behind the scenes, have one big section allocated, and only give the user parts of it. So even if you (the user) were cudaHostRegistering those sections returned by malloc, they'd actually be fragments of some bigger previously allocated chunk of memory anyway.

By the way, Linux has a similar kernel call to lock memory called mlock. It accepts arbitrary memory ranges.

From @einpoklum's answer (original version):

If you need to copy the part-of-the-object just once to the GPU - there's no use in using cudaHostRegister(), because it will likely itself copy the data, physically, elsewhere - so you won't be saving anything

Registering is worth it, if the chunk of memory being copied is big enough, even if the copying is done only once. I'm seeing about a 2x speed-up with this code (comment out the line indicated)

#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>

int main()
{
    std::size_t giga = 1024*1024*1024;
    std::vector<char> src(giga, 3);
    char* dst = 0;
    if(cudaMalloc((void**)&dst, giga)) return 1;

    cudaDeviceSynchronize();
    auto t0 = std::chrono::system_clock::now();

    if(cudaHostRegister(src.data()   src.size()/2, giga/8, cudaHostRegisterDefault)) return 1; // comment out this line
    if(cudaMemcpy(dst, src.data()   src.size()/2, giga/8, cudaMemcpyHostToDevice)) return 1;

    cudaDeviceSynchronize();
    auto t1 = std::chrono::system_clock::now();
    
    auto d = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
    std::cout << (d / 1e6) << " seconds" << std::endl;
    // un-register and free
}