Style: You could easily give your loop labels meaningful names, like copy_search:
and outer:
You have some major missed optimizations, so yes, there are some things that will help.
(Since this is homework, I'm just going to describe them, not actually implement them. As tempting as it would be to write my own implementation, then I'd have to debug it. The real uses would be pretty limited, though. State-of-the art sorting on x86 for short arrays (e.g. as a base case for MergeSort or QuickSort) is a SIMD SSE2 or AVX2 sorting network using shuffles and packed min/max instructions. I think branchy sorting is usually not the best choice in x86 asm even for tiny arrays like 3 or 4 elements long: then you'd probably just want some scalar branchless code.)
The easiest way to get better asm would be a rewrite in C, and compile that with optimization enabled. That's a good idea in general to get ideas for how to optimize: even if you're an expert at asm optimization, the compiler might think of something you didn't. Using compiler output as the starting point for tuning is usually good.
The halfpoint
thing is a puzzle for you to solve of how to do that cheaply without recomputing array length inside the outer loop. Hint: Insertion Sort doesn't even look at elements beyond the already-sorted region at the bottom, and the next element to be inserted, regardless of the final stop-condition you pass as a length. Hoisting that condition out of the loop entirely is probably the only algorithmic optimization; the rest is just implementing Insertion Sort more efficiently in asm.
(In the general case for loop-invariant conditions inside loops, you can make 2 versions of the loop and branch to select which one to run. But here we can just tweak the sort loop inputs before it starts.)
One of the most obvious optimizations is Why are loops always compiled into "do...while" style (tail jump)? - use sub edi,4
/ jnz copy_search_loop
at the bottom of your inner loop.
Another is to load the element to be compared and copied into a register, so you don't have to load it twice.
This will help a bit, but not a huge amount on modern x86. Repeated loads that hit in L1d cache are cheap-ish. Intel and AMD can both do 2 memory operations per clock, up to 1 of which is a store. You're using indexed addressing modes for stores, so Intel Haswell and later can't use the dedicated simple-store AGU on port 7, otherwise it could do 2 loads + 1 store per clock.
After fixing your inner loop to avoid a jmp
(just one fall-through jcc
and a sub
/jcc
at the bottom), your loop should only be 4 uops on Haswell and later (mov-load, macro-fused cmp/jcc, mov-store, macro-fused sub/jcc). Depending on how it decodes, one of the branches might not macro-fuse on Sandybridge. (SnB/IvB can't make 2 fusions in one decode group of up-to-4 uops that hit the decodes in the same cycle). So with a register cmp/jcc
instead of memory, the inner loop can maybe run at 1 iteration per clock (when it doesn't mispredict).
If you have an cmp reg, 0
left after optimizing to use flags set by sub
or dec
, optimize them into test reg,reg
which is 1 byte shorter but otherwise pretty much equal performance. (And sets all flags identically except AF, which you can't branch on).
It will most likely only ever have 20 elements in the array. Maybe up to 30.
And it will run a few thousand times I believe w/ same array I think.
Ok, that means branch prediction might "learn" the pattern of branching for small sorts, re-running the function repeatedly for the same data.
That means that tuning the asm might actually matter. With randomized data, most gains would be dwarfed by the amount of cycles the CPU spends recovering from branch mispredictions. Of course, more efficient code can let it detect mispredictions sooner, and just chew through the work until the next mispredict slightly faster. Modern CPUs are already very good at chewing through redundant instructions (as long as they don't increase critical-path latency much), because a lot of machine code that executes on modern CPUs is created by JITed compilers in JVMs and similar, and isn't very well optimized.
For sorts of size 20 or 30, the copy-search loop will run multiple iterations quite a few times (unless the input is already close to sorted), so presumably its branches will usually predict correctly as continuing to loop. And optimizing it to do fewer loads, and run fewer instructions, for the still-searching case should actually help.
Remember, optimize your code for the common case. For loops, that usually means making the "keep looping" case fast. This means it's sometimes worth it to spill (store something to memory) something outside a loop to free up more registers for use inside the loop.
Another way to save registers is to use pointer increments instead of base+index. Or to leave some read-only data in memory, especially if you only read it once per outer-loop iteration. e.g. the outer loop condition could be a cmp [end_pointer], esi
/ jb
.
For non-tiny arrays: vectorize the copy-search loop with SSE2 or AVX2
(And maybe for small if you can make the overhead low enough, the lower limit being when most elements are inserted within 3 elements of the end of the sorted region. Or if overhead is higher, a larger average copy size is needed to make it worth it.)
For larger sorts, Insertion Sort will spend a lot of time copying the array over 1 element at a time. i.e. the inner loop will often run many iterations. We can gain a lot by vectorizing it with SSE2 to copy-and-search 4 elements in parallel.
// something like this, I haven't checked the logic carefully
movd xmm0, eax ; key
pshufd xmm0, xmm0, 0 ; broadcast the key to all 4 elements: [key, key, key, key]
;; TODO: handle edi not a multiple of 4 somewhere
copy_search:
movdqu xmm1, [esi+edi] ; load 16 bytes
movdqa xmm2, xmm0
pcmpgtd xmm2, xmm1 ; packed xmm2 = key > arr[i + 0..3]
pmovmskb eax, xmm2
test eax, eax
jnz .found_element_less_or_equal_key ; figure out which element from the bitmap, and do something. e.g. movd [mem], xmm0 to store the new element because we destroyed EAX.
movdqu [esi+edi+4], xmm1 ; store after checking, because we might not want to move all 4.
sub esi, 16
jg copy_search
;; else fall through: key goes in one of the first 1 to 3 elements
;; handle the non-multiple-of-4 case here because it's rarely reached
;; and doing it here instead of at the start avoid store-forwarding stalls for short counts
If the input array is 16-byte aligned, handling the non-multiple-of-4 case before entering the copy-search loop is tempting. Then all the loads can be aligned. (Stores will still be misaligned, though, so maybe even handle it in a way that makes the stores aligned instead?) But modern CPUs have efficient unaligned load handling, and cache-line splits aren't super expensive. Page splits (across a 4k boundary) are very expensive on Intel before Skylake. (Like ~100 extra cycles vs. same cost as a cache-line split).
But unless you can do that very cheaply, avoiding it entirely most of the time (except when the copy-search loop reaches the very front of the array) is probably a win.
The loop termination condition can't be as simple as i > 0
, because we need to avoid a decrement by 4 elements going off the start of the array. But by offsetting esi
(the array base) we can still handle that with just a sub/jcc
as the loop condition, not needing a sub
and then cmp/jcc
against a pointer. (But if tuning for IvyBridge or Sandybridge, it would probably be worth using a pointer-increment so the store could stay micro-fused. AMD can't fuse sub/jcc
, though, so it's only Haswell+ where this indexed addressing and sub/jge
is actually probably the most optimal way (without unrolling))
This should avoid store-forwarding stalls because we're writing memory after we've loaded from it. The overlap from the +4
(1/4 of a vector) is into a previous read, not into the next read. Even for small sorts we shouldn't get store-forwarding stalls: the next outer iteration will restart the inner loop aligned with the position where we wrote it.
But the overhead of handling non-multiple-of-4 and not going past the start of the array will hurt for small sorts, because the actual vector loop is only running a few iterations.