5

I've been struggling for some time a problem I can't seem to find a solution to. The problem is that when I try to debug my CUDA code using Nvidia Nsight under Visual Studio 2008 I get strange results when using shared memory.

My code is:

template<typename T>
__device__
T integrate()
{
   extern __shared__ T s_test[]; // Dynamically allocated shared memory
   /**** Breakpoint (1) here ****/
   int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering
   if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing
      s_test[index] = (T)index;
      /* Some other irelevant code here */
   }
   return v;
}

When I reach breakpoint 1 and inspect the shared memory inside Visual Studio Watch window only the first 8 values of the array change and the others remain null. I would expect all first 64 to do so. Watch window from Visual Studio

I thought it might have something to do with all warps not executing simultaneously. So I tried synchronizing them. I added this code inside integrate()

template<typename T>
__device__
T integrate()
{
   /* Old code is still here */

   __syncthreads();
   /**** Breakpoint (2) here ****/
   if(index < 64 && blockIdx.x==0) {
      T tmp = s_test[index]; // Write to tmp variable so I can inspect it inside Nsight Watch window
      v = tmp + index; // Use `tmp` and `index` somehow so that the compiler doesn't optimize it out of existence
   }
return v;
}

But the problem is still there. Furthermore the rest of the values inside tmp are not 0 as the Watch window form VS is indicating. Watch window from Nsight

I must mention that it takes a lot of steps to step over __syncthreads(), so when I reach it I just jump to breakpoint 2. What the heck is going on!?


EDIT Information about the system/launch configuration

System

  • Name Intel(R) Core(TM)2 Duo CPU E7300 @ 2.66GHz
  • Architecture x86
  • Frequency 2.666 MHz
  • Number of Cores 2
  • Page Size 4.096
  • Total Physical Memory 3.582,00 MB
  • Available Physical Memory 1.983,00 MB
  • Version Name Windows 7 Ultimate
  • Version Number 6.1.7600

Device GeForce 9500 GT

  • Driver Version 301.42
  • Driver Model WDDM
  • CUDA Device Index 0
  • GPU Family G96
  • Compute Capability 1.1
  • Number of SMs 4
  • Frame Buffer Physical Size (MB) 512
  • Frame Buffer Bandwidth (GB/s) 16
  • Frame Buffer Bus Width (bits) 128
  • Frame Buffer Location Dedicated
  • Graphics Clock (Mhz) 812
  • Memory Clock (Mhz) 500
  • Processor Clock (Mhz) 1625
  • RAM Type DDR2

IDE

  • Microsoft Visual Studio Team System 2008
  • NVIDIA Nsight Visual Studio Edition, Version 2.2 Build No. 2.2.0.12255

Compiler comands

1> "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\bin\nvcc.exe" -G -gencode=arch=compute_10,code=\"sm_10,compute_10\" --machine 32 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin" -D_NEXUS_DEBUG -g -D_DEBUG -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -I"inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\include" -maxrregcount=0 --compile -o "Debug/process_f2f.cu.obj" process_f2f.cu

Launch configuration. The shared memory size and doesn't seem to matter. I've tried several versions. The one I've worked with the most is:

  • Shared memory 2048 Bytes
  • Grid/block sizes : {101, 101, 1} , {16, 16, 1}
Iam
  • 381
  • 1
  • 3
  • 13
  • Can you provide device info, nsight version, driver version, launch dimensions, and dynamic shared memory size? If you provide a full reproducible including compiler switches then the team can look into the issue. – Greg Smith Oct 02 '12 at 21:17
  • I've added all the information. Thank you very much for your help. – Iam Oct 03 '12 at 10:35
  • 1
    @Iam, I am unable to repro this issue. Would you mind getting into contact with me? You may email me at devtools-support at nvidia dot com. Thanks! – Jeff Davis Oct 12 '12 at 03:14
  • I'm voting to close this question as off-topic because the problem couldn't be reproduced and seems to have turned into an off Stack Overflow bug report with a vendor – talonmies Jan 01 '16 at 14:30

1 Answers1

1

Have you tried putting __syncthreads() after assigning the values?

template<typename T>
__device__
T integrate()
{
   extern __shared__ T s_test[]; // Dynamically allocated shared memory
   int index = threadIdx.x + threadIdx.y * blockDim.x; // Local index in block. Column major ordering
   if(index < 64 && blockIdx.x==0) { // Only work on a few values. Just testing
      s_test[index] = (T)index;
      /* Some other irelevant code here */
   }
   __syncthreads();
   /**** Breakpoint (1) here ****/
   return v;
}

And try to see the values at this breakpoint.

Younes Nj
  • 532
  • 6
  • 16