How to create globally-accessible variable on cuda?

Question

This is a pretty complicated question, and I'm not a native English speaker, so I'll thanks if you are patient enough to read my question.

As Cuda is actually operating on two computers, it is invalid to point to a host's memory when you are on device, which means you cannot copy structs (or objects) to device if they have pointer members.

I tried to make the following system to solve this issue:

use integers instead of pointers. The integer is an offset inside a memory pool. The integer is wrapped in a class (overloads "->" and "*") to make it looks like a pointer.
the memory pool object manages a continuous array of objects, which can be easily transferred to Cuda device. The pool's content synchronizes between host and device, so an integer offset would have same meaning on both two sides.

To conveniently use the offset, it should be wrapped. In host side, the wrapper looks like this:

template<typename T>
class MemPoolPointer {
public:
    inline T* operator -> () const
    {
        return &( MemPool<T>::get_instance.get_object(_p) );
    }
    uint64_t _p;
}

We can see, the pointer class requires globally access of the memory pool. This is usually implemented by make the memory pool to be singleton. However, Cuda do not allow static members, and it limits __device__ variables to be file scope. How can I workaround these limitations? Or I should try OpenCL?

AFAIK dereferencing, using pointer operations, at CUDA involves high costs. In my codes I generally use flat arrays. There are concepts like using SOA or AOS structures. At CPU we use AOS and at GPU we generally use Struct of Arrays (SOA). — phoad, Oct 08 '12 at 22:53
@KerrekSB Yes, I don't necessarily need the singleton. What I need is a global access, but \__device__ variables are file scope only. — jiandingzhe, Oct 09 '12 at 01:14
@jiandingzhe: Err... what's the problem? "File scope" *is* global, non? — Kerrek SB, Oct 09 '12 at 01:15
@phoad it cannot be simply solved by using SOA, there are many objects shared in different places in my system. If I copy those things for each owner, the memory consumption will be pretty large. — jiandingzhe, Oct 09 '12 at 01:16
@KerrekSB: Isn't "file scope" to be something like "static functions" of C, which can only be seen within the same source file? — jiandingzhe, Oct 09 '12 at 01:20
Your class (singleton or not), can reside on the CPU, and own pointers to host and device memory allocated using normal `cudaMalloc`, etc. You can then retrieve device pointers (offset as needed) from the class at kernel invocation time and pass them to the kernel. I see no problem here... — harrism, Oct 09 '12 at 01:22
@jiandingzhe: I think you're confusing "scope" and "linkage"! — Kerrek SB, Oct 09 '12 at 01:38
@harrism: I don't want to pass those device pointers to kernel, because there will be so many pointers (about ten I guess). So I want to have them exist on device side as global variables. — jiandingzhe, Oct 10 '12 at 03:29
@KerrekSB: so what does this mean? "__device__ and __constant__ variables are only allowed at file scope." — jiandingzhe, Oct 10 '12 at 09:01
You don't have to pass the pointers individually, just pass the object that owns the pointers by value to the kernel (see [this answer](http://stackoverflow.com/questions/9309195/copying-a-struct-containing-pointers-to-cuda-device/9323898#9323898) -- I'm tempted to mark this question as a duplicate of that one). As long as you only dereference the *device* pointers owned by the object in the device code, you are safe. — harrism, Oct 10 '12 at 11:40
BTW, the concept of using an integer as an offset into a memory pool "instead of" a pointer is amusing -- that's all a pointer is, after all! — harrism, Oct 10 '12 at 11:41
@jiandingzhe: It means that you can have global variables that are declared as `__device__` or `__constant__`. — Kerrek SB, Oct 10 '12 at 13:50
@harrism: I can't simply allocate the objects on device side, because many times they are also required on host side. So I'm trying to make a uniform way that uniformly synchronize data between host and device. — jiandingzhe, Oct 11 '12 at 08:45
Fine, if you insist on a singleton... the singleton need not be static with respect to device code. Make it static on the host, and pass it as an argument to the kernel... — harrism, Oct 11 '12 at 10:11
@jiandingzhe: Could you please add your solution as an *answer* to this question? You will later be able to accept your own answer (this is allowed), and the question will be marked as solved, making it easier for other people to find by search. Thank you. — talonmies, Aug 18 '13 at 07:07
@jiandingzhe: it would be helpful if you accept the community wiki answer I have added to this question so that drops off the CUDA unanswered question list — talonmies, Dec 26 '16 at 19:35

score 1 · Accepted Answer · answered Nov 14 '16 at 10:32

The OP was able to solve this by wrapping a global scope __device__ variable using a static class method like this:

class FooBar;
__device__ FooBar* FOOBAR_DEVICE_POOL;
class FooBar
{
    __device__ static FooBar& DEVICE_GET(uint64_t p);
}

template<typename T>
class MemPoolPointer {
public:
    inline T* operator -> () const
    {
#ifdef __CUDA_ARCH__
        return &( T::DEVICE_GET(_p) );
#else
        return &( MemPool<T>::get_instance.get_object(_p) );
#endif
    }
    uint64_t _p;
}

[this answer added as a community wiki entry to get the question off the unanswered queue for the CUDA tag]

How to create globally-accessible variable on cuda?

1 Answers1