16

Can someone give a clear explanation of how the new and delete keywords would behave if called from __device__ or __global__ code in CUDA 4.2?

Where does the memory get allocated, if its on the device is it local or global?

It terms of context of the problem I am trying to create neural networks on the GPU, I want a linked representation (Like a linked list, but each neuron stores a linked list of connections that hold weights, and pointers to the other neurons), I know I could allocate using cudaMalloc before the kernel launch but I want the kernel to control how and when the networks are created.

Thanks!

Twiltie
  • 522
  • 1
  • 6
  • 13

1 Answers1

23

C++ new and delete operate on device heap memory. The device allows for a portion of the global (i.e. on-board) memory to be allocated in this fashion. new and delete work in a similar fashion to device malloc and free.

You can adjust the amount of device global memory available for the heap using a runtime API call.

You may also be interested in the C++ new/delete sample code.

CC 2.0 or greater is required for these capabilities.

chappjc
  • 29,576
  • 6
  • 70
  • 120
Robert Crovella
  • 120,849
  • 8
  • 160
  • 206
  • Thank you very much! That makes sense. – Twiltie Jan 19 '13 at 18:33
  • 1
    @Twiltie: Using `__device__` `malloc()/free()` or `new/delete` can have negative implications for performance. See http://stackoverflow.com/a/13485322/442006. – Roger Dahl Jan 21 '13 at 00:32
  • @RogerDahl: I see, I will have to keep this in mind. I was planning on using this to create and train neural networks on the GPU to play a specialized form of Conway's game of life. I planned on genetic training, so maybe I will set a fixed limit to then number of active neural networks and just allocate them one time. Thank you for the insight! – Twiltie Jan 21 '13 at 18:37
  • 1
    @Twiltie: Also check if you can use local, fixed size C arrays instead (`int myvalues[200];`); If you use `__device__` `malloc()` or if you allocate separate chunks of memory for each thread, you don't get fully coalesced memory accesses. This is because, when the 32 threads in a warp process an instruction that does a load or store, each thread reads from locations that are separated by the size of the allocated blocks, while, ideally, they should be reading adjacent values. The compiler stores fixed size C arrays in an interleaved way, enabling accesses to be fully coalesced. – Roger Dahl Jan 21 '13 at 21:24
  • The new/delete sample code in the SDK is a great example of how placement new can help utilize shared memory. Thanks for pointing it out! – chappjc Mar 05 '15 at 01:06