18

I was looking for a way to find out where my program spends time. I read the perf tutorial and tried to profile sleep times as it is described there. I wrote the simplest possible program to profile:

#include <unistd.h>
int main() {
  sleep(10);
  return 0; 
}

then I executed it with perf:

$ sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB /home/pablo/perf.data.raw (~578 samples) ]
$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2
$ sudo perf report --stdio --show-total-period -i ~/perf.data
Error:
The /home/pablo/perf.data file has no samples!

Does anybody know how to avoid these errors? What do they mean? failed to write feature 2 doesn't look too user-friendly...

Update:

$ uname -a
Linux debian 3.12-1-amd64 #1 SMP Debian 3.12.9-1 (2014-02-01) x86_64 GNU/Linux
Pavel Davydov
  • 2,799
  • 1
  • 25
  • 39
  • I don't think perf is the right tool. It monitors CPU usage, and your program uses almost no CPU time. – ugoren Feb 16 '14 at 08:52
  • 1
    @ugoren this is a special mode for profiling sleep times, perf only looks for scheduler switch events, not for cpu cycles. – Pavel Davydov Feb 16 '14 at 09:26
  • Currently, the tutorial you're referring to is misleading, you won't get anything useful by following it. I don't think perf currently allows to profile sleeping times, [report is filled](https://bugzilla.kernel.org/show_bug.cgi?id=207323). – Hi-Angel Apr 17 '20 at 14:52
  • @Hi-Angel this question was asked in 2014. Now we have ebpf, so I don't think somebody needs perf sleep times profiling today... – Pavel Davydov Apr 20 '20 at 16:39
  • @PavelDavydov thanks for mentioning, although some links would really help. I think you were referring to bcc scripts, and [here's a usage example](https://github.com/iovisor/bcc/blob/master/tools/offwaketime_example.txt), both alone and with a flamegraph. – Hi-Angel Apr 21 '20 at 20:55

3 Answers3

12

There is a error message from your second perf command from https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times - perf inject -s

$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2

failed to write feature 2 doesn't look too user-friendly...

... but it was added to perf to made errors more user-friendly: http://lwn.net/Articles/460520/ "perf: make perf.data more self-descriptive (v5)" by Stephane Eranian , 22 Sep 2011:

+static int do_write_feat(int fd, struct perf_header *h, int type,  ....
+           pr_debug("failed to write feature %d\n", type);

All features are listed here http://lxr.free-electrons.com/source/tools/perf/util/header.h#L13

 15         HEADER_TRACING_DATA     = 1,
 16         HEADER_BUILD_ID,

So, it sounds like perf inject was not able to write information about build ids (error from function write_build_id() from util/header.c) if I'm not wrong. There are two cases which can lead to error: unsuccessful call to perf_session__read_build_ids() or failing in writing buildid table dsos__write_buildid_table (this is not our case because there is no "failed to write buildid table" error message; check write_build_id)

You may check, do you have all buildids needed for the session. Also it may be useful to clear your buildid cache (rm -rf ~/.debug), and check that you have up-to-date vmlinux with debugging info or kallsyms enabled in your kernel.

UPDATE: in comments Pavel says that his pref record had no any sched:sched_stat_sleep events written to perf.data:

sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out

As he explains in his answer, his default debian kernel have CONFIG_SCHEDSTATS option disabled with vendor's patch. The redhat did the same thing with the option in release kernels since 3.11, and this is explained in Redhat Bug 1013225 (Josh Boyer 2013-10-28, comment 4):

We switched to enabling that only on debug builds a while ago. It seems that was turned off entirely with the final 3.11.0 build and has remained off since. Internal testing shows the option has a non-trivial performance impact for context switches.

We can turn this on in debug kernels again, but I'm not sure it's worthwhile.

Josh Poimboeuf 2013-11-04 in comment 8 says that performance impact is detectable:

In my tests I did a lot of context switches under various CPU loads. I saw a ~5-10% drop in average context switch speed when CONFIG_SCHEDSTATS was enabled. ...The performance hit only seemed to happen on post-CFS kernels (>= 2.6.23). The previous O(1) scheduler didn't seem to have this issue.

Fedora disabled CONFIG_SCHEDSTAT in non-debug kernels at 12 July 2013 "[kernel] Disable LATENCYTOP/SCHEDSTATS in non-debug builds." by Dave Jones. First kernel with disabled option: 3.11.0-0.rc0.git6.4.

In order to use any perf software tracepoint event with name like sched:sched_stat_* (sched:sched_stat_wait, sched:sched_stat_sleep, sched:sched_stat_iowait) we must recompile kernel with CONFIG_SCHEDSTATS option enabled and replace default Debian, RedHat or Fedora kernels which have no this option.

Thank you, Pavel Davydov.

Community
  • 1
  • 1
osgx
  • 80,853
  • 42
  • 303
  • 470
  • Thanks, is it possible to make it work somehow? Should I build a newer kernel, rebuild the current one, do some configuration? – Pavel Davydov Feb 19 '14 at 06:13
  • @Pavel Davydov, you can rebuild the perf tool from any kernel without recompiling the kernel itself. Do `cd tools/perf`, read `Makefile.perf` and run make (you will need devel packages needed to build linux-tools package). I can recommend you to clear buildid cache (`~/.debug` folder), use kernel from the distrib with debugging symbols installed. Then, you can rebuild perf and debug it with gdb or printfs. Also try `-N` option of `perf record`. – osgx Feb 19 '14 at 11:01
  • Debugging perf sounds like a challenging task. I'll try either to debug it, or to check vmlinux and kernel symbols when I have time. – Pavel Davydov Feb 19 '14 at 11:12
  • By the way, this error disappears if I execute inject without -v (verbose) flag. Unfortunately, the resulting perf.data is still empty this way. – Pavel Davydov Feb 19 '14 at 11:16
  • Did you check the `perf.data.raw` file with `perf report` or `perf script > output.txt` ? – osgx Feb 19 '14 at 11:40
  • I have checked it now, I executed perf report -i ~/.perf.data.raw, it has two sched_switch samples, one exit sample, but it has zero sched_stat_sleep samples. It looks like this is the answer why the result has no samples. -s flag makes perf to merge switch and sleep events, so merge result should be empty if one of them has no samples. On the other hand, I found -b flag for perf inject, that adds build-ids into the output stream. With this flag `perf report` shows just the same samples as it does for ~/perf.data.raw. – Pavel Davydov Feb 19 '14 at 11:59
  • 1
    This looks strange, why it has no sleep samples, when my program calls `sleep` for 4 seconds? – Pavel Davydov Feb 19 '14 at 12:02
  • Also, perf inject printed a message: symsrc__init: cannot get elf header. Using /usr/lib/debug/lib/modules/3.12-1-amd64/vmlinux for symbols. - Maybe I have to check if my kernel has debug symbols.. – Pavel Davydov Feb 19 '14 at 12:04
  • Thanks, your answer helped me a lot, I haven't found out what's going on completely yet, but now I know at least how to find it out. – Pavel Davydov Feb 21 '14 at 07:36
  • Maybe we should add a link to this answer to perf wiki? It can save a lot of time for people who are going to use this feature, and there are almost no info about it in the web... – Pavel Davydov May 13 '14 at 17:24
  • First link on this page (in your question) is already to the [**only perf tutorial from its authors**](https://perf.wiki.kernel.org/index.php/Tutorial). I think we (you) can add to the perf wiki information about disabled sched_stat_* events and their dependence on SCHEDSTAT kernel config option. Also, you must *send a patch* to the [LKML](http://en.wikipedia.org/wiki/Linux_kernel_mailing_list) to put some definitions of `sched_stat_*` tracepoints (`include/trace/events/sched.h`) under the same ifdef as their usage. Then on kernels without SCHEDSTAT there will be no any perf events for them. – osgx May 13 '14 at 19:24
  • Thanks for the reputation bonus! Yes, I'll try to send a patch, as soon as I'll have time. – Pavel Davydov May 22 '14 at 07:00
  • 3
    In newer kernels, you also need to enable the sched stats, e.g. via: ` echo 1 | sudo tee /proc/sys/kernel/sched_schedstats` – milianw Aug 16 '16 at 09:09
7

I finally found out how to make it work. The problem was that the default debian kernel is built without some config options, that perf needs to be able to monitor sleep times. It looks like CONFIG_SCHEDSTATS should be enabled to make kernel collect scheduler statistics. This is told to have some runtime overhead. Also I enabled CONFIG_SCHED_TRACER and some lock tracing options, but I'm not sure if they matter in my case. Anyway, no statistic data is collected in scheduler without CONFIG_SCHEDSTATS (see kernel/sched/ directory of kernel source).

Also, there is a very good article about perf written by Brendan Gregg, with a lot of usefull examples and some kernel options that are needed to make perf work properly.

Update: I checked the history of CONFIG_SCHEDSTATS in debian. I've checked out debian kernel patches and build scripts repo:

svn checkout svn://svn.debian.org/svn/kernel/dists/trunk/linux/debian

And then found CONFIG_SCHEDSTATS option there

$ grep -R CONFIG_SCHEDSTAT config/
config/config:# CONFIG_SCHEDSTATS is not set

This string was added to the repo in commit 10837, on 2008-03-14, with comment "debian/config: Do complete reorganization". Also, in this and this (thanks to osgx) bug reports it is told that CONFIG_LATENCYTOP, CONFIG_SCHEDSTATS options are not enabled because they can affect kernel perfomance. So, I think it just was never switched on in default debian kernels. I haven't found the discussion about scheduler stats option, though. If I do, I will write back here.

Pavel Davydov
  • 2,799
  • 1
  • 25
  • 39
  • So, now your "`perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit`" works? – osgx May 11 '14 at 21:49
  • @osgx yes, now it finally works! Even most selftests in `perf test` now pass. Thanks for your help. It's a pitty that the only way to find it out is to search in kernel code.. – Pavel Davydov May 12 '14 at 07:54
  • Did you find the `sched_stat_sleep` tracepoint usage in kernel code (not in headers)? I think, this tracepoint can be used in the some code, disabled by turning off `CONFIG_SCHEDSTATS`... – osgx May 12 '14 at 08:09
  • 1
    yes, I have 3.13.10 kernel source with debian patches (from repo), and in `kernel/sched/fair.c` it disables both (`trace_sched_stat_wait` and `trace_sched_stat_sleep`) tracepoints. Even more, it looks like it just disable all scheduler statistics collection, not only trace points. – Pavel Davydov May 12 '14 at 08:14
  • Pavel Davydov, Can you identify the patch? What is its name, and what does it says? Can you fill a debian bug? – osgx May 12 '14 at 08:53
  • 1
    @osgx hm, I think I have to check it first.. I thought debian kernel team disabled it intentionally because of it's known overhead. I thought that because I saw [redhat bug report](https://bugzilla.redhat.com/show_bug.cgi?id=1013225). – Pavel Davydov May 12 '14 at 16:11
  • 1
    I have found debian bugs 481684 and 600935 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=481684 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=600935 - 2.6.26 in debian had already disabled SCHEDSTATS. We can try to search `latencytop`-not working reports, because CONFIG_SCHEDSTAT is one of options needed for it; like https://groups.google.com/forum/#!topic/debian-chinese/EdtPIPovJbQ I think Ubuntu enabled this option, but I'll recheck it. – osgx May 12 '14 at 17:23
  • @osgx I added an update to my answer, I think this option just was never enabled in debian kernel, and they consider it is correct, due to perfomance reasons. – Pavel Davydov May 13 '14 at 16:38
  • In several ubuntus, including 12.04 and 14.04 LTS, the `CONFIG_SCHEDSTAT` option is enabled in the default kernel (versions 2.6.38, 3.0, 3.2, 3.13). "perfomance reasons" for SCHEDSTAT should be noticeable only when there is very high count of running processes, and lot request to scheduler (and in some lmbench test too). As was asked in one of debian bugs, why not to enable some stats needed for tracing, if x86 and x86_64 have "CONFIG_JUMP_LABEL=y" by default, which converts disabled tracepoint into unconditional jump (it is "nop" on most modern CPUs because it is always predicted correctly). – osgx May 13 '14 at 19:17
  • Do you know how to enable it? Using ubuntu 16.04 – Justin Thomas Sep 18 '16 at 15:42
  • @JustinThomas I'm afraid you need to recompile your kernel with CONFIG_SCHEDSTAT enabled to get it, if it doesn't work on default one. However, you can try to do the same with other profiling tools, for example see this systemtap script: https://sourceware.org/systemtap/examples/process/cycle_thief.stp . It does not do exactly the same, however it shows how to instrument the scheduler, and it works on most stock kernels. Also eBPF is a good option, but I think it requires a newer kernel (ubuntu 16 has 4.4 AFAIK). – Pavel Davydov Sep 19 '16 at 09:42
1

This works for me for "perf version 3.11.1" on an "openSUSE 13.1 (x86_64)" box.

Here is the output if you care:

# ========
# captured on: Sun Feb 16 09:49:38 2014
# hostname : *****************
# os release : 3.11.10-7-desktop
# perf version : 3.11.1
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz
# cpuid : GenuineIntel,6,58,9
# total memory : 32945368 kB
# cmdline : /usr/bin/perf inject -v -s -i perf.data.raw -o perf.data 
# event : name = sched:sched_stat_sleep, type = 2, config = 0x48, config1 = 0x0, config2 = 0x
# event : name = sched:sched_switch, type = 2, config = 0x51, config1 = 0x0, config2 = 0x0, e
# event : name = sched:sched_process_exit, type = 2, config = 0x4e, config1 = 0x0, config2 = 
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7, 
# ========
#
# Samples: 0  of event 'sched:sched_stat_sleep'
# Event count (approx.): 0
#
# Overhead        Period  Command  Shared Object  Symbol
# ........  ............  .......  .............  ......
#


# Samples: 8  of event 'sched:sched_switch'
# Event count (approx.): 80099958776
#
# Overhead        Period  Command      Shared Object             Symbol
# ........  ............  .......  .................  .................
#
   100.00%   80099958776      bla  [kernel.kallsyms]  [k] thread_return
                |
                --- thread_return
                    thread_return
                    do_nanosleep
                    hrtimer_nanosleep
                    SyS_nanosleep
                    system_call_fastpath
                    0x7fbc0dec6570
                    __GI___libc_nanosleep
                    (nil)



# Samples: 0  of event 'sched:sched_process_exit'
# Event count (approx.): 0
#
# Overhead        Period  Command  Shared Object  Symbol
# ........  ............  .......  .............  ......
#


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
}
hivert
  • 10,116
  • 3
  • 27
  • 51
  • Thanks, so, what do the errors I get mean and how can I make this feature work? Have you used some special flags to compile you kernel? – Pavel Davydov Feb 16 '14 at 09:22
  • 1
    I don't know the meaning of the error I didn't recompile my kernel and I issued a `echo 0 > /proc/sys/kernel/kptr_restrict` before launching perf to allows it to access kernel address map. I think you should upgrade your perf. – hivert Feb 16 '14 at 09:25
  • It looks like my version of perf is newer than yours, I have 3.12. It is the newest one in debian testing repository, so to install even newer perf I need to compile a vanilla kernel, but I don't know what flags to use.. – Pavel Davydov Feb 17 '14 at 07:50