16

Linux perf-tools are great for finding hotspots in CPU cycles and optimizing those hotspots. But once some parts are parallelized it becomes difficult to spot the sequential parts since they take up significant wall time but not necessarily many CPU cycles (the parallel parts are already burning those).

To avoid the XY-problem: My underlying motivation is to find sequential bottlenecks in multi-threaded code. The parallel phases can easily dominate the aggregate CPU-cycle statistics even though the sequential phases dominate wall time due to amdahl's law.

For java applications this is fairly easy to achieve with visualvm or yourkit which have a thread-utilization timelines.

yourkit thread timeline

Note that it shows both thread state (runnable, waiting, blocked) and stack samples for selected ranges or points in time.

How do I achieve something comparable with perf or other native profilers on linux? It doesn't have to be a GUI visualization, just a way to find sequential bottlenecks and CPU samples associated with them.

the8472
  • 35,110
  • 4
  • 54
  • 107
  • The screenshoot with timeline is from tracing tool, not from profiling one. Check kernelshark+trace-cmd or LTTng tracers to get [the same](https://static.lwn.net/images/2010/lttv.png). perf is universal and it may have some information inside perf.data even in default mode (printed with perf script); but for exact info about thread scheduling it should also trace sched_* events. And perf usually profiles only threads while they are running on CPU (especially when you are not on AWS or other virtualization and may use hardware counter 'cycles'), not the wall time. – osgx Aug 19 '17 at 02:17
  • @osgx well, I can do without samples when threads are off-CPU. In principle I just want to figure out what it does during periods where it spends time (~= gets samples) on a single thread. It might be a bit of an XY-problem. My goal is finding single-thread bottlenecks. The most obvious approach to me is visualizing stack samples on a per-thread basis. – the8472 Aug 19 '17 at 03:07
  • Default mode of perf is on-CPU sampling profiling. Try interactive `perf report` text user interface or options to "focus" on some threads: `perf record -g -F 99 -s ./your_program; perf report -T` or `perf report -T --tid=$TID` where $TID is pid of one thread or comma separated list. I did not test -s/-T options to split thread stats, but they are documented: http://man7.org/linux/man-pages/man1/perf-record.1.html http://man7.org/linux/man-pages/man1/perf-report.1.html; per-thread is default mode: https://perf.wiki.kernel.org/index.php/Tutorial#Collecting_samples – osgx Aug 19 '17 at 05:44
  • 1
    @osgx unless I am missing something those don't help because they only provide aggregate stats. I'm interested not in aggregates but in subsets of the samples where only one thread is active. Basically, I'm not interested in the times where the code is already concurrent, but those dominate in the reports even though they may not dominate the wall time. That's what those thread-timeline views in java provide (they also show thread stacks at specific time slices - not shown in the screenshot - but important to figure everything out) – the8472 Aug 19 '17 at 07:19
  • Intel VTune supports this type of visualization. But it's only free for students and educators. – Hadi Brais Nov 09 '18 at 01:59

2 Answers2

4

Oracle's Developer Studio Performance Analyzer might do exactly what you're looking for. (Were you running on Solaris, I know it would do exactly what you're looking for, but I've never used it on Linux, and I don't have access right now to a Linux system suitable to try it on).

This is a screenshot of a multithreaded IO test program, running on an x86 Solaris 11 system:

Screenshot of multithreaded IO performance test prorgam

Note that you can see the call stack of every thread along with seeing exactly how the threads interact - in the posted example, you can see where the threads that actually perform the IO start, and you can see each of the threads as they perform.

This is a view that shows exactly where thread 2 is at the highlighted moment:

enter image description here

This view has synchronization event view enabled, showing that thread 2 is stuck in a sem_wait call for the highlighted period. Note the additional rows of graphical data, showing the synchronization events (sem_wait(), pthread_cond_wait(), pthread_mutex_lock() etc):

enter image description here

Other views include a call tree:

enter image description here

a thread overview (not very useful with only a handful of threads, but likely very useful if you have hundreds or more

enter image description here

and a view showing function CPU utilization

enter image description here

And you can see how much time is spent on each line of code:

enter image description here

Unsurprisingly, a process that's writing a large file to test IO performance spent almost all its time in the write() function.

The full Oracle brief is at https://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf

Quick usage overview:

Andrew Henle
  • 27,654
  • 3
  • 23
  • 49
  • The other views don't seem to be scoped by the timeline, or are they? If they are not then they can't be used to drill down to the time ranges where it was single-threaded. – the8472 Nov 09 '18 at 18:15
  • @the8472 Well, you can *see* what's happening between threads on the timeline view. I've added a view where synchronization events are visible. You can see that the highlighted thread was in waiting in `sem_wait()`. Not sure what you mean by "scoped by the timeline". Are you looking for data such as how much CPU time was used based on timeline selection? – Andrew Henle Nov 09 '18 at 18:23
  • Kind of. What I really want is to look at stack samples of running threads where the application is not fully parallelized. The bottlenecks. Looking for the one active thread when others are idle on the timeline and then selecting that range and getting the aggregate stats for that range would be the way I do it with the java tools I am familiar with. Other ways to do the filtering would be fine too, or better in fact since selection only gets me one bottleneck at a time instead of overall stats. – the8472 Nov 09 '18 at 18:29
  • I think the example I added - [screenshot](https://i.stack.imgur.com/dsNbD.png) - shows just that. That shows thread 2 blocked in `sem_wait()`, and even shows the exact line of code. In my experience, that's the best way to find non-parallel bottlenecks - you look at the timeline view, and you simply *see* the time ranges where just about every thread blocks. – Andrew Henle Nov 09 '18 at 18:37
  • Yeah, I was talking about the other screenshots, they seem to be unrelated to the question. Does a range-selection over many samples on the timeline show aggregate samples in the call stack view or just a single stacktrace? If it's just one stack trace it wouldn't be any different from the chain graphs @GalS is suggesting. – the8472 Nov 09 '18 at 18:46
  • @the8472 The data can be filtered by the selection, if that's what you mean. That would limit the data presented on the other views to the data aggregated from just the selected samples. In the GUI, it's done with a right-mouse button popup menu. – Andrew Henle Nov 09 '18 at 20:16
  • This seems sufficient, so I am inclined to accept. Edit: Actually, I'll wait a little more and see if someone else might have a perf-based solution. – the8472 Nov 09 '18 at 22:27
  • @the8472 I'd actually expect you to download it and try the software before accepting the answer. While it's a good tool, it may not do what you need. – Andrew Henle Nov 10 '18 at 04:48
  • 1
    [got it to work](https://i.imgur.com/vUJI7fi.png) after a few obstacles. it wants an ancient java version and hangs on applications with jemalloc, I had to to recompile with glibc malloc. – the8472 Nov 15 '18 at 02:49
2

You can get the result you want using a great tool we use to analyze Off-CPU Analysis - Off-CPU Flame Graphs which is apart of Flame Graphs

I used the Off-CPU analysis

Off-CPU analysis is a performance methodology where off-CPU time is measured and studied, along with context such as stack traces. It differs from CPU profiling, which only examines threads if they are executing on-CPU.

This tool is based on the tools you mentioned as the preferred ones - perf, bcctools, however, it provides a really easy to use output called flame graph which interactive SVG file looks like this SVG Off-CPU Time Flame Graph.

enter image description here

The width is proportional to the total time in the code paths, so look for the widest towers first to understand the biggest sources of latency. The left-to-right ordering has no meaning, and the y-axis is the stack depth.

2 more helpful analysis which are part of the Off-CPU Flame Graphs can also help you - Personally, I did not tried them.

Wakeup

This lets us solve more problems than off-CPU tracing alone, as the wakeup information can explain the real reason for blocking.

And Chain Graph

Chain graphs are an experimental visualization that associates off-CPU stacks with their wakeup stacks

There is also an experimental visualization which combines both CPU and Off-CPU flame graphs Hot/Cold Flame Graphs

This shows all thread time in one graph, and allows direct comparisons between on- and off-CPU code path durations.

It requires a little time to read about this profiling tool and understands its concepts, however, using it is super easy and its output is easier to analyze than other tools you mentioned above.

Good Luck!

Gal S
  • 880
  • 6
  • 15
  • 1
    That is not quite what I am looking for. I do not want the samples of threads being descheduled, that will usually just show IO waits or stacks of thread pools with empty work queues. I want the on-CPU samples but only when it's running fewer than N threads. Or the samples weighted by the number of idle cores. Basically it has to answer the question "which call stacks representing single-threaded do I have to speed up or eliminate to get back to parallel execution". Even the waker part only shows only the immediate stack of the wakeup, not all the preceding computation that had to finish. – the8472 Nov 09 '18 at 17:41
  • @the8472: on a hyperthreaded Intel CPU, there's a perf counter for `cpu_clk_unhalted.one_thread_active`, which ticks when a thread has a core to itself. That can help find code that runs when fewer than max cores are utilized. There are also Linux `perf` events for `cstate_core/c3-residency/` and c6, c7, so you might be able to count when whole cores have idled for a while. (But I don't think that helps you find what code *is* running on other cores.) – Peter Cordes Nov 14 '18 at 19:15