What might cause the same SSE code to run a few times slower in the same function?

Question

Edit 3: The images are links to the full-size versions. Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table.

I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3 -fPIC:

In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).

What am I missing?

Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.

I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.

Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...

Same profile, albeit with a slightly lower CPI this time.

@SegFault Then it would be in the loading times from the RAM or cache to the registers, which is the `movapsx` instruction. Here it's just the FPU operations, as far as I understand. — iksemyonov, Oct 07 '16 at 09:36
Multithreading? Your image is hard to read, but maybe the FPU is busy with other stuff in one of the cases... — Sven Nilsson, Oct 07 '16 at 10:18
@SvenNilsson right-click and open in new tab to see the full resolution. All the threads are busy with the same task, the program is using OpenMP. — iksemyonov, Oct 07 '16 at 10:19
@SvenNilsson See update, same problem. Thanks for the suggestion! — iksemyonov, Oct 07 '16 at 10:32
@SvenNilsson testing with 2 threads, maybe it's really related to hyperthreading in a sense, but still, look at it, that's the same code, and it happens very single time! Edit: same picture given 2 threads — iksemyonov, Oct 07 '16 at 10:38

score 3 · Answer 1 · answered Oct 07 '16 at 11:23

3

Well, analyzing assembly code please note that running time is attributed to the next instruction - so, the data you're looking by instructions need to be interpreted carefully. There is a corresponding note in VTune Release Notes:

Running time is attributed to the next instruction (200108041)

To collect the data about time-consuming running regions of the target, the Intel® VTune™ Amplifier interrupts executing target threads and attributes the time to the context IP address.

Due to the collection mechanism, the captured IP address points to an instruction AFTER the one that is actually consuming most of the time. This leads to the running time being attributed to the next instruction (or, rarely to one of the subsequent instructions) in the Assembly view. In rare cases, this can also lead to wrong attribution of running time in the source - the time may be erroneously attributed to the source line AFTER the actual hot line.

In case the inline mode is ON and the program has small functions inlined at the hotspots, this can cause the running time to be attributed to a wrong function since the next instruction can belong to a different function in tightly inlined code.

answered Oct 07 '16 at 11:23

Vital

181
5

Thanks for the link! I'm aware of the offset, but now confused about the direction of it. In my example, are the longer bars really caused by the `mul/add/sub` instructions? – iksemyonov Oct 07 '16 at 11:26
The time of the instruction A will be shown as the time of instruction A+1 - so, yes, longer bars are caused by sub/mul/add instructions actually. – Vital Oct 07 '16 at 11:29
OK, that was what I've been assuming, thanks for the clarification! Now, yeah, little remains: figure out why they take 2x to 3x more time in the second cluster, while being called the same amount of times :) – iksemyonov Oct 07 '16 at 11:30
1

Try running General Exploration in VTune to observe microarchitecture issues - e.g. more cache misses for the 2nd part. – Vital Oct 10 '16 at 05:02
TY will do now! IS it available in the UI? I ran the cache miss analysis about a year ago using the CLI but totally forgot how to compose the long-ish command. – iksemyonov Oct 10 '16 at 05:04
Yes, you'll find it under "Microarchitecture Analysis" section in UI. In CLI simply use "amplxe-cl -c general-exploration". – Vital Oct 10 '16 at 19:50

score 3 · Accepted Answer · answered Oct 07 '16 at 21:56

HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.

The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[], so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)

The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0], which has a really large bar.

But in the second group, the time for the cache misses in a_transfer[] is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.

It looks like there's a lot you could optimize here.

instead of store/reload for a_pointf, just keep it hot across loop iterations in a __m128 variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi with _mm_cvttps_epi32(vf), so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).

It might be worth it to partially vectorize the address math (calculation of base_0, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)

Use MOVD to load two adjacent short values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0), except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0).

Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float in parallel with _mm_cvtepi32_ps (CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
Calling floorf() is silly, and a function call forces the compiler to spill all xmm registers to memory. Compile with -ffast-math or whatever to let it inline to a ROUNDSS, or do that manually. Especially since you go ahead and load the float that you calculate from that into a vector!
Use a vector compare instead of scalar prev_x / prev_y / prev_z. Use MOVMASKPS to get the result into an integer you can test. (You only care about the lower 3 elements, so test it with compare_mask & 0b0111 (true if any of the low 3 bits of the 4-bit mask are set, after a compare for not-equal with _mm_cmpneq_ps. See the double version of the instruction for more tables on how it all works: http://www.felixcloutier.com/x86/CMPPD.html).

The profs at the uni (we have a few asm folk here) told me to profile for 200s minimum and increase the sampling rate. Which I did and received more level results, but that needs further profiling, anyway. The optimizations you've listed netted me some 25-30% I think, though in a new version of the algorithm, in certain situations, I've managed to cut the weight of this code path down to a minimum. Still, learnt a lot! — iksemyonov, Oct 23 '16 at 04:24
@iksemyonov: cool, thanks for the real numbers on how much improvement you got. 200s sounds like longer than you usually need to profile things. 1 second usually gives good results for microbenchmarking a single fairly fast function (assuming you account for CPU frequency ramp-up). (i.e. at least ~100k to 10M repeated calls to the actual function under test). If the code you're testing takes a long time to run, and touches a lot of memory, then yeah, longer test intervals will be needed to average over multiple runs and let the cache settle down to the access pattern. — Peter Cordes, Oct 23 '16 at 04:30

What might cause the same SSE code to run a few times slower in the same function?

2 Answers2