6

I have tried to overlap kernel executions with memcpyasync but it doesn't work. I follow all recommendations in programming guide, using pinned memory, different streams, etc. I see kernel execution do overlap but it doesn't with mem transfers. I know my card has only one copy engine and one execution engine, but execution and tranfers should overlap, right?

It seems the "copy engine" and "execution engine" always enforce the order I call the functions. Work consists on 4 streams performing [HtoD x2, Kernel, DtoH]. If I issue HtoDx2,Kernel,DtoH serie on each stream, I see in profiler like the stream2 HtoD first operation will not start until the first DtoH operation ends. If I issue first the HtoD on each stream, then the second HtoD, then kernel and then DtoH (breadth) I see no overlap and the issue-order is also enforced by the GPU.

I have tried with the simpleStreams example given in CUDA SDK and I also see the same behavior.

I attach some screen captures showing the issue in both, visual profiler and Nsight for VS2008.

ps. I don't have set the CUDA_LAUNCH_BLOCKING env

Simple Streams Visual Profiler Simple Streams Visual Profiler

MyApp Nsight timeline breadth first MyApp Nsight timeline breadth first

MyApp Nsight timeline depth first MyApp Nsight timeline depth first

edit:

puting extra x4 kernels (total 2HtoD, 5 kernel, 1DtoH per stream) --> If I run nvprof with and without --concurrent-kernels-off, the elapsed time is the same. If I Set the env CUDA_LAUNCH_BLOCKING=1 then I see a performance improvement (from the command-line) of 7.5%!

System specification:

  • Windows 7
  • NVIDIA 6800 VGA in first PCI-E slot
  • GTX480 in second PCI-E slot
  • NVIDIA Driver: 306.94
  • Visual studio 2008
  • CUDA v5.0
  • Visual Profiler 5.0
  • Nsight 3.0
einpoklum
  • 86,754
  • 39
  • 223
  • 453
Dredok
  • 722
  • 7
  • 25
  • 1
    In the depth first example there is possible concurrency between there is no possibility for overlap as the GTX480 only has a single copy engine. In the breadth first example there is the potential to overlap between the HtoD and the kernels and kernels and DtoH. For Nsight VSE you may want to make sure you did not enable serialized trace. Please check the option under Nsight|Options...|Analysis|CUDA Kernel Trace Mode. If you post a reproducible I can help identify the problem. – Greg Smith Jan 23 '13 at 04:39
  • Edited: In depth first example I would expect memcpy from 2nd stream start when memcpy from first stream ends and overlap it (partially) with kernel execution (and so on). – Dredok Jan 23 '13 at 08:47
  • btw Kernel Trace Mode is Concurrent (thanks for pointing that) – Dredok Jan 23 '13 at 08:50
  • Please provide a concrete sourcecode that reproduces the problem. – RoBiK Jan 24 '13 at 11:33
  • @Dredok: I was trying to help you when I realized that I see the same behavior on my system. I know that I have seen overlap before on my system but i don't know what have changed since the last time I saw it. I have the same spec. Win 7 64bit, multiGPU system with GTX580, CUDA 5.0, driver 310.90. When running simpleStreams from SDK it spends more time on the streamed version than the serialized and I see now overlap what so ever in Nsight :/ This is rely bugging me. – brano Jan 24 '13 at 12:13
  • @RoBiK: A concrete sourcecode would be the simpleStreams from SDK. – brano Jan 24 '13 at 12:14
  • yes, simpleStreams from SDK also fails at overlapping memcpys and kernel executions. At other computer, running just one 8800GTS I see the overlap ... could it be due to multiGPU configuration? – Dredok Jan 24 '13 at 13:09
  • @brano: I can confirm the same behavior with my setup (CUDA 5.0, VS 2010, driver version 306.94, GTX 560 Ti). This looks like a driver problem. I tried also an old version of simpleStreams compiled against and running with CUDA 3.0 and i get the same behavior. – RoBiK Jan 24 '13 at 13:10
  • @RoBiK: Are you also running Win 7? It could be a driver problem but it could also be the WDDM. – brano Jan 24 '13 at 13:20
  • @Dredok: Are you running Win 7 on the system with only 1 GPU and what about CUDA toolkit version and driver? – brano Jan 24 '13 at 13:24
  • @brano On the system with only 1 GPU I run Win 7 x64 with more recent driver (cant remember but I would say 310.x) and CUDA 5.0 – Dredok Jan 24 '13 at 13:25
  • @Dredok: Hmmm so that leaves the multiGPU configuration as a possibility. I will have to verify this by pulling out all GPUs except 1. It this is true than win 7 as a development OS is useless in a multiGPU configuration. – brano Jan 24 '13 at 13:28
  • @brano: yes, it is a Win 7 32 bit machine with a single GPU. Later today i can also try another machine with Win 7 x64 and a different GPU. – RoBiK Jan 24 '13 at 14:37
  • I can confirm, at home, using win7 x64, GTS 8800 only-one GPU, driver 310.70 the simpleStreams behavior is the expected one (overlaping between kernel execution and mem transfers). – Dredok Jan 24 '13 at 14:58
  • Another SDK example you can try is the simpleMultiCopy. It produces the same problem. – brano Jan 24 '13 at 15:06
  • Overlapping not working on Win 7 x64, GT 520M with developer driver 309.64 – RoBiK Jan 24 '13 at 20:46
  • tested on 307.74 driver and it presents the same unexpected behavior :/ – Dredok Jan 29 '13 at 12:03
  • Have you reported this issue to the NVIDIA developers? This may be worth opening a ticket. You just need to be a [registered CUDA developer](https://developer.nvidia.com/joining-cuda-registered-developer-program). You can also try the [latest beta drivers](http://www.nvidia.com/Download/Find.aspx?lang=en) (320.00). – BenC May 07 '13 at 11:17
  • @BenC How can I open this ticket? Should I do it in devtalk nvidia forums? – Dredok May 07 '13 at 14:31
  • 1
    @Dredok: once you're registered, login and post your ticket [here](https://developer.nvidia.com/rdp/bugs/cudagpu-bug-reporting). – BenC May 07 '13 at 14:36
  • good news, they told me there is a bug indeed so they are trying to fix it. – Dredok May 08 '13 at 17:03

3 Answers3

0

As said in my comment, there is indeed a BUG with CUDA drivers and it makes streaming not working with my Setup. I have tested 1.1 capabilites card (8800 GTS) and 3.5 capabilities card (GTX Titan) and both cards works fine. It seems there is a problem with some Fermi cards (my GTX 480 does not work).

Dredok
  • 722
  • 7
  • 25
0

I just incurred with the same problem. I agree with your that there is a BUG. I think the bug is either in CUDA driver for Windows, or in the Windows itself. I have tested my code and it works well (with overlapping) in Linux.

In fact, you could test the "simpleStreams" example in SDK. I found that the "simpleStreams" running in Windows doesn't have overlapping between kernel and memory copy at all, but when in Linux it works perfectly.

I am using CUDA 5.0 and Fermi GTX570. With your test on 8800GT and GTX Titan, I would agree it is a bug in the CUDA driver for Windows. Hopefully it will be fixed soon.

0

TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the TDR delay value to a very high number, and the "enabled" option to true, the issue goes away. Please, try the options described below (more common), because they are also related to the problem!

Read below for other (older) steps followed until i came to the solution above, and some other possible causes.

I just recently were able to partially solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.

Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!

  1. Make sure your concurrency problem is not related to other issues like old drivers (including bios), wrong code, incapable device, etc.
  2. Go to computer>properties
  3. Select advanced system settings on the left side
  4. Go to the Advanced tab
  5. On Performance click settings
  6. In the Visual Effects tab, select the "adjust for best performance" bullet.

This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!

Alternatively, you can:

  1. Right click on desktop, select personalize
  2. Select a theme from basic themes, that doesn't have aero.

This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.

Please, when you try these solutions, come back here and post your findings!

For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...

UPDATE: The problem is fully solved with a new windows installation!! The previous steps improved the behavior for some cases, but a fresh install solved all the problems!

I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.

Community
  • 1
  • 1