Trouble launching CUDA kernels from static initialization code

Question

I have a class that calls a kernel in its constructor, as follows:

"ScalarField.h"

#include <iostream>

    void ERROR_CHECK(cudaError_t err,const char * msg) {
        if(err!=cudaSuccess) {
            std::cout << msg << " : " << cudaGetErrorString(err) << std::endl;
            std::exit(-1);
        }
    }

    class ScalarField {
    public:
        float* array;
        int dimension;

        ScalarField(int dim): dimension(dim) {
            std::cout << "Scalar Field" << std::endl;
            ERROR_CHECK(cudaMalloc(&array, dim*sizeof(float)),"cudaMalloc");
        }
    };

"classA.h"

#include "ScalarField.h"


static __global__ void KernelSetScalarField(ScalarField v) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < v.dimension) v.array[index] = 0.0f;
}

class A {
public:
    ScalarField v;

    A(): v(ScalarField(3)) {
        std::cout << "Class A" << std::endl;
        KernelSetScalarField<<<1, 32>>>(v);
        ERROR_CHECK(cudaGetLastError(),"Kernel");
    }
};

"main.cu"

#include "classA.h"

A a_object;

int main() {
    std::cout << "Main" << std::endl;
    return 0;
}

If i instantiate this class on main (A a_object;) i get no errors. However, if I instantiate it outside main, just after defining it (class A {...} a_object;) I get an "invalid device function" error when the kernel launches. Why does that happen?

EDIT

Updated code to provide a more complete example.

EDIT 2

Following the advice in the comment by Raxvan, I wanted to say i have the dimensions variable used in ScalarField constructor also defined (in another class) outside main, but before everything else. Could that be the explanation? The debugger was showing the right value for dimensions though.

Could you provide more codeto help answer these questions : Is class A in it's own file but the kernel is in another, what are the file extension, etc. You should provide enough code for others to be able to replicate your problem. — deathly809, Jul 21 '14 at 15:49
@Noel Perez Gonzalez if you defined `a_Object` as a global variable it starts execution during the global data initialization. This is a very bad practice since there is not way of knowing the execution order. Having this in mind it is possible that the code that initializes all the CUDA stuff runs later than your global data. — Raxvan, Jul 21 '14 at 15:54
Updated the question with more code (please note i have not compiled it). @Raxvan Thanks for the advice, I just thought runtime order was the same as compile order. — Noel, Jul 21 '14 at 16:34
Indeed your code snippets seem to be incomplete do not compile. Please, post the minimum sized code that someone else could copy, paste, compile and run from his side, indicating the case when your code works and when your code does not work. — Vitality, Jul 21 '14 at 16:49
@JackOLantern I notice that it never even makes it to the main method. I am not really familiar with the CUDA initialization procedure. Does it have to hit the main method first? I tried to initialize CUDA using cudaChooseDevice but it still did not work. — deathly809, Jul 21 '14 at 21:22
_invalid device function_ typically means that the runtime cannot find a binary matching the GPU architecture, see [thrust::device_vector error](http://stackoverflow.com/questions/9711495/thrustdevice-vector-error). I do not have an explanation on why you cannot initialize global objects by launching kernel functions, which may have to do with the C++ mechanism of instantiating global objects. There is surely a workaround that consists into defining a default constructor as in [here](http://pastebin.com/rsSFrnfe). — Vitality, Jul 21 '14 at 21:45

talonmies · Accepted Answer · 2016-08-29T08:19:46.613

The short version:

The underlying reason for the problem when class A is instantiated outside of main is that a particular hook routine which is required to initialise the CUDA runtime library with your kernels is not being run before the constructor of class A is being called. This happens because there are no guarantees about the order in which static objects are instantiated and initialised in the C++ execution model. Your global scope class is being instantiated before the global scope objects which do the CUDA setup are initialised. Your kernel code is never being loaded into the context before it is call, and a runtime error results.

As best as I can tell, this is a genuine limitation of the CUDA runtime API and not something easily fixed in user code. In your trivial example, you could replace the kernel call with a call to cudaMemset or one of the non-symbol based runtime API memset functions and it will work. This problem is completely limited to user kernels or device symbols loaded at runtime via the runtime API. For this reason, an empty default constructor would also solve your problem. From a design point of view, I would be very dubious of any pattern which calls kernels in the constructor. Adding a specific method for class GPU setup/teardown which doesn't rely on the default constructor or destructor would be a much cleaner and less error prone design, IMHO.

In detail:

There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error. This is a part of the "lazy" context initialisation feature of the runtime API. You can confirm this for yourself as follows:

Here is a gdb trace of the revised example you posted. Note I insert a breakpoint into __cudaRegisterFatBinary, and that isn't reached before your static A constructor is called and the kernel launch fails:

talonmies@box:~$ gdb a.out 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/talonmies/a.out...done.
(gdb) break '__cudaRegisterFatBinary' 
Breakpoint 1 at 0x403180
(gdb) run
Starting program: /home/talonmies/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Scalar Field
[New Thread 0x7ffff5a63700 (LWP 10774)]
Class A
Kernel : invalid device function 
[Thread 0x7ffff5a63700 (LWP 10774) exited]
[Inferior 1 (process 10771) exited with code 0377]

Here is the same procedure, this time with A instantiation inside main (which is guaranteed to happen after the objects which perform lazy setup have been initialised):

talonmies@box:~$ cat main.cu
#include "classA.h"


int main() {
    A a_object;
    std::cout << "Main" << std::endl;
    return 0;
}

talonmies@box:~$ nvcc --keep -arch=sm_30 -g main.cu
talonmies@box:~$ gdb a.out 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/talonmies/a.out...done.
(gdb) break '__cudaRegisterFatBinary' 
Breakpoint 1 at 0x403180
(gdb) run
Starting program: /home/talonmies/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, 0x0000000000403180 in __cudaRegisterFatBinary ()
(gdb) cont
Continuing.
Scalar Field
[New Thread 0x7ffff5a63700 (LWP 11084)]
Class A
Main
[Thread 0x7ffff5a63700 (LWP 11084) exited]
[Inferior 1 (process 11081) exited normally]

If this is really a crippling problem for you, I would suggest contacting NVIDIA developer support and raising a bug report.

Excellent answer. Would the same occur also for "globally" initialized Thrust objects? — Vitality, Jul 22 '14 at 10:47
@JackOLantern: This problem would be limited to any object which calls a kernel in its constructor. If a thrust object calls a kernel during construction, then it should be effected (and I guess device_vector, which sets a default value during instantiation would be a candidate, although I haven't looked at the source for a long time to be sure). Thanks for the complement. This is my 700th (and probably last) answer on [SO]. — talonmies, Jul 22 '14 at 10:55
Very instructive answer. I have resorted to a member function for initializing the data as you adviced. Thanks. — Noel, Jul 22 '14 at 11:44
I don't think you need to stop answering on StackOverflow. From time to time, interesting questions continue to appear. — Vitality, Jul 22 '14 at 13:02
Yes, I also hope you will continue. Perhaps the rest of us can handle the "dross" and you can take the harder ones. :-) — Robert Crovella, Jul 22 '14 at 15:14
@einpoklum: That was my last answer. Everything else has been community wiki entries. — talonmies, Mar 06 '16 at 16:06
@talonmies: Is there a runtime API function to verify, within host code, that the cuda runtime infrastructure has been erected (and not yet destructed) before calling a cuda API function? — MathManM, Aug 07 '18 at 13:27

Trouble launching CUDA kernels from static initialization code

1 Answers1

Linked

Related