Analyzing cause of performance regression with different kernel version

Question

I have come across a strange performance regression from Linux kernel 3.11 to 3.12 on x86_64 systems. Running Mark Stock's Radiance benchmark on Fedora 20, 3.12 is noticeably slower. Nothing else is changed - identical binary, identical glibc - I just boot a different kernel version, and the performance changes. The timed program, rpict, is 100% CPU bound user-level code.

Before I report this as a bug, I'd like to find the cause for this behavior. I don't know a lot about the Linux kernel, and the change log from 3.11 to 3.12 does not give me any clue.

I observed this on two systems, an Intel Haswell (i7-4771) and an AMD Richland (A8-6600K). On the Haswell system user time went from 895 sec with 3.11 to 962 with 3.12. On the Richland, from 1764 to 1844. These times are repeatable to within a few seconds.

I did some profiling with perf, and found that IPC went down in the same proportion as the slowdown. On the Haswell system, this seems to be caused by more missed branches, but why should the prediction rate go down? Radiance does use the random number generator - could "better" randomness cause the missed branches? But apart from OMAP4 support, the RNG does not have to seem changed in 3.12.

On the AMD system, perf just points to more idle backend cycles, but the cause is not clear.

Haswell system:

3.11.10  895s user, 3.74% branch-misses, 1.65 insns per cycle
3.12.6   962s user, 4.22% branch-misses, 1.52 insns per cycle

Richland system:

3.11.10  1764s user, 8.23% branch-misses, 0.75 insns per cycle
3.12.6   1844s user, 8.26% branch-misses, 0.72 insns per cycle

I also looked at a diff from the dmesg output of both kernels, but did not see anything that might have caused such a slowdown of a CPU-bound program.

I tried switching the cpufreq governor from the default ondemand to peformance but that did not have any effect.

The executable was compiled using gcc 4.7.3 but not using AVX instructions. libm still seems to use some AVX (e.g. __ieee754_pow_fma4) but these functions are only 0.3% of total execution time.

Additional info:

Diff of kernel configs
diff of the dmesg outputs on the Haswell system.
diff of /proc/pid/maps - 3.11 maps only one heap region; 3.12 lots.
perf stat output from the A8-6600K system
perf stats w/ TLB misses dTLB stats look very different!
/usr/bin/time -v output from the A8-6600K system

Any ideas (apart from bisecting the kernel changes)?

You could try using a few different benchmarks and seeing how they compare to the Radiance benchmark. Knowing what it is in the workload that is running slower may help. — superdesk, Jan 08 '14 at 15:58
@superdesk I already tried some smaller programs (bzip2 and stream) and could not detect any slowdown. — Chris, Jan 08 '14 at 16:56
I also tried a different cpufreq governor to no avail. Edited question to mention this. — Chris, Jan 09 '14 at 09:39
I added a link to the diffs of the dmesg output of the two kernels on one system. Maybe this will contain a hint. — Chris, Jan 10 '14 at 09:32
Can you post diff of kernel configs, diff of /proc/$pid/maps (after turning off ASLR with randomize_va_space = 0)? Did you compare all perf events, both software - like pagefaults and hardware? — osgx, Jan 10 '14 at 19:28
@osgx thanks for the suggestions. I'll try to do all that on monday (the systems are in my office). As mentioned, I'm using the Fedora kernels, so the configs should be available in the source rpms. — Chris, Jan 10 '14 at 20:47
THe benchmark you have reported, is it running in userland, kernel, or both? This is important! It would be interesting to run benchmark that runs solely in userland. — Tomas, Jan 11 '14 at 14:09
It is not feasible to compare two systems like that. **You have to compile both kernels with the exact same configuration.** — Tomas, Jan 11 '14 at 14:12
@Tomas I mentioned this in the question: "100% CPU bound user-level code". Userland only. Also, the point of the question is not to compare performance between kernels, but to find out which difference between the two distribution kernels caused the performance regression. Maybe the regression would go away if they were compiled with exactly the same config. That would not answer my question, though. — Chris, Jan 11 '14 at 14:46
Maybe the difference would go away with the same config, maybe it would not! Maybe there's some problem in the new version of the kernel. You will not know unless you try. You will have to try it anyway to test the answers, if you get some. — Tomas, Jan 11 '14 at 14:49
@osgx Added /proc/pid/maps diff. 3.11 maps just one heap region, but 3.12 a lot of them. Is this possibly a clue? — Chris, Jan 13 '14 at 10:47
Chris, thank you. I found no major diff in kernel configs. Please, check pagefault stats from `perf stat` and from `/usr/bin/time -v` (and post their diffs too). Lot of "[heap]" segments is bit strange, it says that VMA composition was different; possible brk difference (check `strace ...|grep brk`). Also, it can be useful to test some 3.12.x kernels where x is less than 6. — osgx, Jan 13 '14 at 23:33
@osgx thanks. Perf output: http://www.chr-breitkopf.de/tmp/perf-stat.A8.txt Will try to add more info during the day (I have to use those systems for other tasks too) — Chris, Jan 14 '14 at 06:12
Same number of `brk()` calls for both kernels (1096 calls). I guess I'll have to start compiling kernels soon... — Chris, Jan 14 '14 at 14:45
Added more perf stats from A8 system. There is a significant difference in dTLB misses: Factor 15! — Chris, Jan 16 '14 at 14:56
I think that can be from hugepage usage on 3.11 and non-usage on 3.12 due to brk randomisation. Check your /proc/PID/smaps for AnonHugePages and /proc/vmstat for thp* values; documented here https://www.kernel.org/doc/Documentation/vm/transhuge.txt — osgx, Jan 16 '14 at 22:09
@osgx You found the cause! After `echo never > /sys/kernel/mm/transparent_hugepage/enabled`, 3.11.10 takes as long as 3.12.6! If you write this up as an answer, I'll accept it and award the bounty. Additional info on how to disable the randomization, and on where to report this as a bug (a 7% performance regression is quite severe) would be appreciated. — Chris, Jan 17 '14 at 09:47
Too bad: `echo 0 > /proc/sys/kernel/randomize_va_space` does not make 3.12.6 any faster. Sigh. — Chris, Jan 17 '14 at 10:30

osgx · Accepted Answer · 2014-01-17T10:51:45.770

Let's check your perf stat outputs: http://www.chr-breitkopf.de/tmp/perf-stat.A8.txt

Kernel 3.11.10

    1805057.522096 task-clock                #    0.999 CPUs utilized          
           183,822 context-switches          #    0.102 K/sec                  
               109 cpu-migrations            #    0.000 K/sec                  
            40,451 page-faults               #    0.022 K/sec                  
 7,523,630,814,458 cycles                    #    4.168 GHz                     [83.31%]
   628,027,409,355 stalled-cycles-frontend   #    8.35% frontend cycles idle    [83.34%]
 2,688,621,128,444 stalled-cycles-backend    #   35.74% backend  cycles idle    [33.35%]
 5,607,337,995,118 instructions              #    0.75  insns per cycle        
                                             #    0.48  stalled cycles per insn [50.01%]
   825,679,208,404 branches                  #  457.425 M/sec                   [66.67%]
    67,984,693,354 branch-misses             #    8.23% of all branches         [83.33%]

    1806.804220050 seconds time elapsed

Kernel 3.12.6

    1875709.455321 task-clock                #    0.999 CPUs utilized          
           192,425 context-switches          #    0.103 K/sec                  
               133 cpu-migrations            #    0.000 K/sec                  
            40,356 page-faults               #    0.022 K/sec                  
 7,822,017,368,073 cycles                    #    4.170 GHz                     [83.31%]
   634,535,174,769 stalled-cycles-frontend   #    8.11% frontend cycles idle    [83.34%]
 2,949,638,742,734 stalled-cycles-backend    #   37.71% backend  cycles idle    [33.35%]
 5,607,926,276,713 instructions              #    0.72  insns per cycle        
                                             #    0.53  stalled cycles per insn [50.01%]
   825,760,510,232 branches                  #  440.239 M/sec                   [66.67%]
    68,205,868,246 branch-misses             #    8.26% of all branches         [83.33%]

    1877.263511002 seconds time elapsed

There are almost 300 Gcycles more for 3.12.6 in the "cycles" field; and only 6,5 Gcycles were stalls of frontend and 261 Gcycles were stalled in the backend. You have only 0,2 G of additional branch misses (each cost about 20 cycles - per optim.manual page 597; so 4Gcycles), so I think that your performance problems are related to memory subsystem problems (more realistict backend event, which can be influenced by kernel). Pagefaults diffs and migration count are low, and I think they will not slowdown test directly (but migrations may move program to the worse place).

You should go deeper into perf counters to find the exact type of problem (it will be easier if you have shorter runs of test). The Intel's manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf will help you. Check page 587 (B.3.2) for overall events hierarchy (FE and BE stalls are here too), B.3.2.1-B.3.2.3 for info on backend stalls and how to start digging (checks for cache events, etc) and below.

How can kernel influence the memory subsystem? It can setup different virtual-to-physical mapping (hardly the your case), or it can move process farther from data. You have not-NUMA machine, but Haswell is not the exact UMA - there is a ring bus and some cores are closer to memory controller or to some parts of shared LLC (last level cache). You can test you program with taskset utility, bounding it to some core - kernel will not move it to other core.

UPDATE: After checking your new perf stats from A8 we see that there are more DLTB-misses for 3.12.6. With changes in /proc/pid/maps (lot of short [heap] sections instead of single [heap], still no exact info why), I think that there can be differences in transparent hugepage (THP; with 2M hugepages there are less TLB entries needed for the same amount of memory and less tlb misses), for example in 3.12 it can't be applied due to short heap sections.

You can check your /proc/PID/smaps for AnonHugePages and /proc/vmstat for thp* values to see thp results. Values are documented here kernel.org/doc/Documentation/vm/transhuge.txt

@osgx You found the cause! After echo never > /sys/kernel/mm/transparent_hugepage/enabled, 3.11.10 takes as long as 3.12.6!

Good news!

Additional info on how to disable the randomization, and on where to report this as a bug (a 7% performance regression is quite severe) would be appreciated

I was wrong, this multi-heap section effect is not the brk randomisation (which changes only beginning of the heap). This is failure of VMA merging in do_brk; don't know why, but some changes for VM_SOFTDIRTY were seen in mm between 3.11.10 - 3.12.6.

UPDATE2: Possible cause of not-merging VMA:

http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2580 do_brk in 3.11

http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2577 do_brk in 3.12

3.12 just added at the end of do_brk

2663        vma->vm_flags |= VM_SOFTDIRTY;
2664        return addr;

And bit above we have

2635        /* Can we just expand an old private anonymous mapping? */
2636        vma = vma_merge(mm, prev, addr, addr + len, flags,
2637                                        NULL, NULL, pgoff, NULL);

and inside vma_merge there is test for vm_flags

http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L994 3.11

http://lxr.missinglinkelectronics.com/linux+v3.12/mm/mmap.c#L994 3.12

1004        /*
1005         * We later require that vma->vm_flags == vm_flags,
1006         * so this tests vma->vm_flags & VM_SPECIAL, too.
1007         */

vma_merge --> can_vma_merge_before --> is_mergeable_vma ...

 898        if (vma->vm_flags ^ vm_flags)
 899                return 0;

But at time of check, new vma is not marked as VM_SOFTDIRTY, while old is already marked.

The perf logs were from the AMD Richland system - sorry about not pointing that out (though you can see it by elapsed time). Doesn't invalidate your answer, though - memory seems a likely culprit. Upstream 3.12.6 does not show the problem, btw. so I'll be looking at Fedora patches today. — Chris, Jan 16 '14 at 06:03
@Chris, there is the commit from THAT CRAZY RUSSIAN developers of CRIU: https://github.com/torvalds/linux/commit/d9104d1ca9662498339c0de975b4666c30485f4e "mm: track vma changes with VM_SOFTDIRTY bit" - by "cyrillos authored 4 months ago" — osgx, Jan 17 '14 at 10:55
@Chris, the quickfix is to disable CRIU's soft dirty via CONFIG_MEM_SOFT_DIRTY - it will `define VM_SOFTDIRTY 0`. — osgx, Jan 17 '14 at 10:59
3.12 without `CONFIG_MEM_SOFT_DIRTY=y` has low dTLB misses, that is, huge pages are used. Have tested this only with upstream so far, though, not with Fedora 3.12.6. — Chris, Jan 20 '14 at 14:42
Now also verified with the current Fedora 3.12.8 kernel: commenting out `CONFIG_MEM_SOFT_DIRTY=y` fixes the regression. — Chris, Jan 21 '14 at 12:56
There is related thread in LKML: http://lkml.org/lkml/2014/1/22/378 "[Bug 67651] Bisected: Lots of fragmented mmaps cause gimp to fail in 3.12 after exceeding vm_max_map_count" with some fixes. — osgx, Jan 23 '14 at 03:01
Fixed by Cyrill: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34228d473efe764d4db7c0536375f0c993e6e06a commitid 34228d473efe764d4db7c0536375f0c993e6e06a "mm: ignore VM_SOFTDIRTY on VMA merging". Waiting for backport to 3.12 and 3.13 — osgx, Jan 26 '14 at 06:20
Finally fixed in Linux 3.13.3 and Linux 3.12.11, released 2014-02-13. — osgx, Feb 14 '14 at 02:12

score 0 · Answer 2 · answered Jan 15 '14 at 22:49

0

This change could be a likely candidate http://marc.info/?l=linux-kernel&m=138012715018064. I say this loosely as I don't have the resources to confirm. Its worth noting that this was the only significant change to the scheduler between 3.11.10 and 3.12.6.

Anyhow I'm very interested to see the end results of your findings so keep us posted.

answered Jan 15 '14 at 22:49

niharvey

2,814
1
13
24

1

Just tested upstream 3.12.6 w/ Fedora config + make oldconfig, and it does not show the problem. So it looks like this patch is not responsible. – Chris Jan 16 '14 at 06:05

Analyzing cause of performance regression with different kernel version

2 Answers2