How can I speed up this insertion sort? (inline asm in C)

Question

So this insertion sort is written in x86, but embedded in C. It also has a flag that we set to leave after sorting half of the array. Is there any way to increase the performance?

void asmInsSort(int *list, int arrayLen, int halfpoint) {
 _asm
 {

    mov ecx, arrayLen
    mov esi, list
    mov ebx, halfpoint
    mov eax, 0

    more:
        cmp ecx, 0              // Compare current arrayLen w/ 0
            je  done            // If it is equal, we are done
        mov edi, eax            //J = I
        push eax                //push eax (i) to free up register for key
        mov eax, [esi + edi]    //Key = Array[i] technically j
        sub edi, 4              //J - 1
        mov edx, arrayLen       //K = Arraylength
        sar edx, 1              //Signed Shift Right K by 1
        cmp ecx, edx            //IF Arraylength > K
            jg  cont1           //Jump cont1 hey
        cmp halfpoint, 1        //IF halfpoint = 1
            je  done2           //Exit ELSE cont1

    cont1 :
        cmp edi, 0              //If j<= 0 exit loop
            je cont2
        cmp[esi + edi], eax     //If Array[J] <= key exit loop
        jle cont2
        mov edx, [esi + edi]    //k = Array[j]
        mov[esi + edi + 4], edx //Array[j+1] = Array j
        sub edi, 4              //J--               
        jmp cont1

    cont2 :
        mov[esi + edi + 4], eax //Array[j+1] = key              
        pop eax                 //Pop eax to get i from before for use above
        add eax, 4              //Increment i
        dec ecx                 //Decrement ArrayLength
        jmp more

    done2 :             //halfpoint stack reset
        pop eax;
 done:
 }

}

edit:

So I thought I optimized correctly like this:

void asmSort(int *list, int arrayLen, int halfpoint) {
 _asm
 {
    mov ecx, arrayLen                   // n
    mov esi, list
    mov ebx, halfpoint                  // First halfpoint, then j
    mov eax, 4                          // i = 1

    cmp ebx, 1                          // if(!halfpoint)
        jne main_loop                   // go to main loop
    mov ebx, ecx                        // else copy arraylen to temp
    and ebx, 1                          // n%2
    shr ecx, 1                          // n/2
    add ecx, ebx                        // add two together to get loop amount

    main_loop:
        cmp ecx, 0                      // while n > 0
            je end
        mov edx, [esi + eax]            // key = arr[i]
        mov ebx, [eax - 4]              // j = i - 1
        inner_loop:                     // while ( j >= 0 && arr[j] > key )
            cmp ebx, 0                  // (if j < 0, leave)
                jl end_inner
            cmp [esi + ebx], edx        // (if arr[j] <= key, leave )
                jle end_inner
            mov edi, [esi + ebx]        // edi = arr[j]
            mov [esi + ebx + 4], edi    // arr[j + 1] = edi;
            sub ebx, 4                  // j = j - 1;
            jmp inner_loop
        end_inner:
        mov [esi + ebx + 4], edx        // arr[j + 1] = key;
        dec ecx                         // n--
        add eax, 4                      // i++
        jmp main_loop
        end:
 }
 return;
}

But it doesn't work at all now. Not sure what I did wrong.

I tagged [tag:micro-optimization] and [tag:insertion-sort] because I think you're interested in doing the same work faster, *not* replacing it with a completely different `O(n log n)` sorting algorithm like Merge Sort for large inputs. Do you have any specific x86 microarchitectures in mind that you care about optimizing for? Zen vs. Skylake? Do you care about old architectures like in-order Pentium or Atom at all, since you're writing 32-bit code? Do you care about Nehalem or earlier, where front-end fetch/decode bottlenecks are sensitive to lots of instruction-length / alignment factors? — Peter Cordes, May 07 '19 at 04:20
And probably more importantly, do you have a typical problem-size you care about? **Large arrays vs. very small arrays?** For non-tiny arrays, you can use SSE2 or AVX2 to vectorize the search for an element `<=` the one you want to insert, and to optimize copying over the rest of the array. (Are you optimizing for a modern x86 CPU that supports AVX2? Or a non-ancient one that supports SSE2?) — Peter Cordes, May 07 '19 at 04:27
Hey @PeterCordes no specific architecture, and it's a small array. It won't be for anything large — Charles Staal, May 07 '19 at 04:35
What's wrong with just writing it in C and letting the optimizer generate the code? — 1201ProgramAlarm, May 07 '19 at 04:45
Like how small? Even with 16 elements or something, we could probably get a decent speedup with SSE2. Are you interested in using SIMD instructions? Because comparing 4 `int`s in parallel can let you go a lot faster. Even without that (just plain scalar integer instructions), there's a huge amount to clean up and improve, but you'll probably only get a few % speedup, especially if you're not repeatedly sorting the same data so branch mispredicts are the main bottleneck. I'm going to assume "no specific architecture" means "generic modern x86", so not ancient CPUs. We can assume OoO exec. — Peter Cordes, May 07 '19 at 04:46
For example, see [Why are loops always compiled into "do...while" style (tail jump)?](//stackoverflow.com/q/47783926) and rewrite your loops with a `jcc` at the bottom. e.g. `sub edi, 4` / `jnz cont1`. And load once inside `cont1`, and compare to register. Probably leave `halfpoint` in memory and compare it there, freeing up a register. Or better, branch on it at the start and just divide length in half (with `shr`). Or even `len >>= halfpoint` if it's either 0 or 1. In the general case, you can hoist an invariant branch out of a loop by making 2 versions of the loop. — Peter Cordes, May 07 '19 at 04:51
Without a total rewrite you probably won't actually gain much real performance for small arrays. Branch mispredictions are going to be the limiting factor. Modern x86 CPUs are very good at chewing through redundant instructions as long as the inefficiencies don't create a longer critical path latency / dependency chain that is or becomes a bottleneck. Your loop-carried dependencies stay in registers for your inner-most loop, it's only the push/pop eax that's a store/reload. — Peter Cordes, May 07 '19 at 04:58
BTW, I removed the [tag:c] tag because I think you only want assembly answers, not the obvious answer that the compiler would emit more efficient code if you wrote this in C and compiled it. (Although that is true and you should try it, with optimization enabled. It's a great way to learn some optimization tricks.) — Peter Cordes, May 07 '19 at 05:03
Hey! Thanks for all the info. It will most likely only ever have 20 elements in the array. Maybe up to 30. It's a project I'm working on for class and we will be having a contest on whos the fastest. I'm ok with a total re-write if that's what it takes. We're allowed any and all resources online. Yes, it's on a generic x86, but I can't assume which one. Just can only assume modern. — Charles Staal, May 07 '19 at 05:07
I meant a re-write to use SSE2 SIMD like `movdqu xmm0, [edi]` / `pcmpeqd xmm1, xmm0` and so on, which you almost certainly haven't been taught about. Every CPU that supports 64-bit mode supports SSE2 (even when running 32-bit code), and I think modern Windows only runs on CPUs new enough to have it, so you can safely assume SSE2 these days in Windows programs, even in 32-bit code which you're stuck with if you insist on using MSVC's inline asm support. I mean sure there are certainly speedups without making it totally different, though. — Peter Cordes, May 07 '19 at 05:26
Do you have any details about the benchmark competition? Does it *have* to be Insertion Sort? Will the caller pass the *same* array repeatedly so branch prediction will potentially be "warmed up"? Or will it randomize? (Last time I tested to see how bad Bubble Sort was, my Skylake CPU "learned" the full pattern of branching in a 16-element BubbleSort, like under 0.5% branch mispredict rate, when I copied the same input array repeatedly to minimize overhead between calls) — Peter Cordes, May 07 '19 at 05:31
It has to be insertion sort, with the stupid halfpoint thing added. And it will run a few thousand times I believe w/ same array I think — Charles Staal, May 07 '19 at 05:38
@CharlesStaal: The `halfpoint` thing is a puzzle for you to solve into how to do that cheaply *without* recomputing array length inside the outer loop. Hint: Insertion Sort doesn't even *look* at elements beyond the already-sorted region at the bottom, and the next element to be inserted, regardless of the final stop-condition you pass as a length. — Peter Cordes, May 07 '19 at 05:48
So, in C, it would be 'void insertionSort(int arr[], int n, int halfpoint) { int i, key, j; if(halfpoint) n = n/2 for (i = 1; i < n; i++) { key = arr[i]; j = i - 1; while (j >= 0 && arr[j] > key) { arr[j + 1] = arr[j]; j = j - 1; } arr[j + 1] = key; } }' — Charles Staal, May 07 '19 at 05:53
Yup, exactly, just update the length once before any sorting. Maybe even `n >>= halfpoint` if you can assume it's actually 0 or 1. Modulo any off-by-1 errors or creating special cases like a 1-element list now becomes 0. (You can use backticks for code formatting in comments, but you still can't have newlines. Also, you can edit comments for 5 minutes.) — Peter Cordes, May 07 '19 at 05:54
Hey so I thought I implemented it correctly and cleaner, but apparently not. I added the code to the main question. — Charles Staal, May 07 '19 at 06:17
@PeterCordes thanks for the info on the comments. I haven't been on Stack in a long time. — Charles Staal, May 07 '19 at 06:51

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

Style: You could easily give your loop labels meaningful names, like copy_search: and outer:

You have some major missed optimizations, so yes, there are some things that will help.

(Since this is homework, I'm just going to describe them, not actually implement them. As tempting as it would be to write my own implementation, then I'd have to debug it. The real uses would be pretty limited, though. State-of-the art sorting on x86 for short arrays (e.g. as a base case for MergeSort or QuickSort) is a SIMD SSE2 or AVX2 sorting network using shuffles and packed min/max instructions. I think branchy sorting is usually not the best choice in x86 asm even for tiny arrays like 3 or 4 elements long: then you'd probably just want some scalar branchless code.)

The easiest way to get better asm would be a rewrite in C, and compile that with optimization enabled. That's a good idea in general to get ideas for how to optimize: even if you're an expert at asm optimization, the compiler might think of something you didn't. Using compiler output as the starting point for tuning is usually good.

The halfpoint thing is a puzzle for you to solve of how to do that cheaply without recomputing array length inside the outer loop. Hint: Insertion Sort doesn't even look at elements beyond the already-sorted region at the bottom, and the next element to be inserted, regardless of the final stop-condition you pass as a length. Hoisting that condition out of the loop entirely is probably the only algorithmic optimization; the rest is just implementing Insertion Sort more efficiently in asm.

(In the general case for loop-invariant conditions inside loops, you can make 2 versions of the loop and branch to select which one to run. But here we can just tweak the sort loop inputs before it starts.)

One of the most obvious optimizations is Why are loops always compiled into "do...while" style (tail jump)? - use sub edi,4 / jnz copy_search_loop at the bottom of your inner loop.

Another is to load the element to be compared and copied into a register, so you don't have to load it twice.

This will help a bit, but not a huge amount on modern x86. Repeated loads that hit in L1d cache are cheap-ish. Intel and AMD can both do 2 memory operations per clock, up to 1 of which is a store. You're using indexed addressing modes for stores, so Intel Haswell and later can't use the dedicated simple-store AGU on port 7, otherwise it could do 2 loads + 1 store per clock.

After fixing your inner loop to avoid a jmp (just one fall-through jcc and a sub/jcc at the bottom), your loop should only be 4 uops on Haswell and later (mov-load, macro-fused cmp/jcc, mov-store, macro-fused sub/jcc). Depending on how it decodes, one of the branches might not macro-fuse on Sandybridge. (SnB/IvB can't make 2 fusions in one decode group of up-to-4 uops that hit the decodes in the same cycle). So with a register cmp/jcc instead of memory, the inner loop can maybe run at 1 iteration per clock (when it doesn't mispredict).

If you have an cmp reg, 0 left after optimizing to use flags set by sub or dec, optimize them into test reg,reg which is 1 byte shorter but otherwise pretty much equal performance. (And sets all flags identically except AF, which you can't branch on).

It will most likely only ever have 20 elements in the array. Maybe up to 30.

And it will run a few thousand times I believe w/ same array I think.

Ok, that means branch prediction might "learn" the pattern of branching for small sorts, re-running the function repeatedly for the same data.

That means that tuning the asm might actually matter. With randomized data, most gains would be dwarfed by the amount of cycles the CPU spends recovering from branch mispredictions. Of course, more efficient code can let it detect mispredictions sooner, and just chew through the work until the next mispredict slightly faster. Modern CPUs are already very good at chewing through redundant instructions (as long as they don't increase critical-path latency much), because a lot of machine code that executes on modern CPUs is created by JITed compilers in JVMs and similar, and isn't very well optimized.

For sorts of size 20 or 30, the copy-search loop will run multiple iterations quite a few times (unless the input is already close to sorted), so presumably its branches will usually predict correctly as continuing to loop. And optimizing it to do fewer loads, and run fewer instructions, for the still-searching case should actually help.

Remember, optimize your code for the common case. For loops, that usually means making the "keep looping" case fast. This means it's sometimes worth it to spill (store something to memory) something outside a loop to free up more registers for use inside the loop.

Another way to save registers is to use pointer increments instead of base+index. Or to leave some read-only data in memory, especially if you only read it once per outer-loop iteration. e.g. the outer loop condition could be a cmp [end_pointer], esi / jb.

For non-tiny arrays: vectorize the copy-search loop with SSE2 or AVX2

(And maybe for small if you can make the overhead low enough, the lower limit being when most elements are inserted within 3 elements of the end of the sorted region. Or if overhead is higher, a larger average copy size is needed to make it worth it.)

For larger sorts, Insertion Sort will spend a lot of time copying the array over 1 element at a time. i.e. the inner loop will often run many iterations. We can gain a lot by vectorizing it with SSE2 to copy-and-search 4 elements in parallel.

// something like this, I haven't checked the logic carefully
   movd    xmm0, eax             ; key
   pshufd  xmm0, xmm0, 0         ; broadcast the key to all 4 elements: [key, key, key, key]
   ;; TODO: handle edi not a multiple of 4 somewhere
copy_search:
   movdqu  xmm1, [esi+edi]         ; load 16 bytes

   movdqa  xmm2, xmm0
   pcmpgtd xmm2, xmm1             ; packed xmm2 = key > arr[i + 0..3]
   pmovmskb eax, xmm2          
   test     eax, eax
   jnz     .found_element_less_or_equal_key   ; figure out which element from the bitmap, and do something.  e.g. movd [mem], xmm0 to store the new element because we destroyed EAX.

   movdqu  [esi+edi+4], xmm1       ; store after checking, because we might not want to move all 4.

   sub     esi, 16
   jg      copy_search

;; else fall through:  key goes in one of the first 1 to 3 elements
;; handle the non-multiple-of-4 case here because it's rarely reached
;; and doing it here instead of at the start avoid store-forwarding stalls for short counts

If the input array is 16-byte aligned, handling the non-multiple-of-4 case before entering the copy-search loop is tempting. Then all the loads can be aligned. (Stores will still be misaligned, though, so maybe even handle it in a way that makes the stores aligned instead?) But modern CPUs have efficient unaligned load handling, and cache-line splits aren't super expensive. Page splits (across a 4k boundary) are very expensive on Intel before Skylake. (Like ~100 extra cycles vs. same cost as a cache-line split).

But unless you can do that very cheaply, avoiding it entirely most of the time (except when the copy-search loop reaches the very front of the array) is probably a win.

The loop termination condition can't be as simple as i > 0, because we need to avoid a decrement by 4 elements going off the start of the array. But by offsetting esi (the array base) we can still handle that with just a sub/jcc as the loop condition, not needing a sub and then cmp/jcc against a pointer. (But if tuning for IvyBridge or Sandybridge, it would probably be worth using a pointer-increment so the store could stay micro-fused. AMD can't fuse sub/jcc, though, so it's only Haswell+ where this indexed addressing and sub/jge is actually probably the most optimal way (without unrolling))

This should avoid store-forwarding stalls because we're writing memory after we've loaded from it. The overlap from the +4 (1/4 of a vector) is into a previous read, not into the next read. Even for small sorts we shouldn't get store-forwarding stalls: the next outer iteration will restart the inner loop aligned with the position where we wrote it.

But the overhead of handling non-multiple-of-4 and not going past the start of the array will hurt for small sorts, because the actual vector loop is only running a few iterations.

Thanks! A lot of that is very over my head at the moment. But I will try to implement some of it! [The sse2 I'm not sure we're allowed to use as we haven't gone over it at all) — Charles Staal, May 07 '19 at 07:00
@CharlesStaal: Using SSE2 would make it a *lot* more complicated. I can mostly picture in my head how I think some of it would work out, but not all of it. Most of the other optimizations I suggested will save instructions, which is generally a good thing. The microarchitectural reasons I pointed out for some of them aren't necessary to understand, for now just trying to save instructions will be good. (As long as you don't introduce any slow instructions like `loop`.) Basic add/sub / cmp / load/store instructions are all single-uop (except indexed stores on Sandybridge...) — Peter Cordes, May 07 '19 at 07:09
do you mind looking at the new code I posted? I know I still need to change it to do_while (I'm not sure how I will make sure it's valid before, but yeah.) It's not working and I have no clue as to why since it's pretty much 1 to 1 from c to asm — Charles Staal, May 07 '19 at 07:14

How can I speed up this insertion sort? (inline asm in C)

1 Answers1

For non-tiny arrays: vectorize the copy-search loop with SSE2 or AVX2