Fastest assembly code to increment a string of longwords on x64 system

Question

A long time ago I did some hobby programming for Z80 and 68000 assemblers so I understand the basics but I am a novice in x86/x64 assembler. I am trying to find the ** fastest ** code to do the following (the critical part needing optimal speed is steps 2-4 only).

I do not need help with steps 1 or 5 at all, they are shown for context only. I am not looking for a complete code writing, but would appreciate any hints on optimal instructions & algorithms for this platform. There are many ways to write a routine like this but often the obvious approach is not the optimal one. I would be fine if someone said something like "try using the XYZ instruction". Also, as I mention below, the use of an array in assembler might not be the fastest way to go so any suggestions on how to optimally structure the data for speed are also part of the answer I am looking for. (Can x64 assembler even handle a 4GB array with an index?)

Step 1. Read a "string" of longword elements from a file. Each element contains externally supplied data that can be treated as a signed 31-bit number.

The file is usually pretty small (less than 100K) but may be as long a 1GB of elements at times. It is not necessary to read all elements into memory at the same time as long as the individual elements can be accessed/modified directly. Instinctively this sounds like using a 4GB array would be fastest but I am a novice with x64 assembler and not sure if the overhead for an array would help or hurt the speed.

Step 2. Increment the element by 1.
Step 3. Check the sign flag (see if the increment set the high bit).
- If set then branch to a routine that will modify the element then continue on to step 4..
- If not set then jump to Step 5 (exit).

The time spent in the subroutine is outside the scope of this question, you can just use an immediate return instruction for now. The subroutine will however need to know the index of the element.

Step 4. Move on to the next element and repeat step 2.
Step 5. Close the file, saving any modified data.

Also, two related questions:

Would the code run faster on a 32-bit system since the elements are 32 bits?
How would the code be different if Step 2 was an increment other than 1?

RESPONSE TO THE "TOO BROAD" CLOSURE FLAG:

How is this question "Too Broad" even though it precisely fits ALL FOUR of the on-topic descriptions at top of the community guidelines on the SO Help Page:

a specific programming problem -- (how to optimize a special kind of array processing)
a software algorithm -- (the aforementioned array algorithm, specifically Steps 2-4 above)
software tools commonly used by programmers -- (x64 assembler is very commonly used by programmers)
a practical, answerable problem that is unique to software development -- (since several answers/suggestions were provided by @Jester, @PeterCordes, and @Ped7g I would say it is self-evident that this is the case)

The way I read it, your implementation doesn't matter, you will be bottlenecked by I/O. — Jester, May 08 '16 at 00:04
@Jester, not if the entire file is read into memory, but I am not sure if that is practical. I have never worked with a data structure this large before. — O.M.Y., May 08 '16 at 00:07
Even then it takes way more time to read the file and write it out than to process it. So much more, that the time spent processing doesn't matter. If read and write takes 1000 units of time, it doesn't really matter that you process it in 1 or 2 additional units. — Jester, May 08 '16 at 00:08
Understood, **but I did say that I was only concerned with steps 2-4**. The I/O overhead is not what I am trying to optimize. In reality there will be other steps before step 2 and after step 4 (including possible returning to step 2 multiple times) but I was keeping things simple for discussion. — O.M.Y., May 08 '16 at 00:14
I am just saying optimizing the code will make no difference. Also your constraints pretty much make it impossible to vectorize so what you are left with is the naive code using `add` and `jnc`. You might be able to optimize slightly if you have control over the calling convention or if you can inline the subroutine. — Jester, May 08 '16 at 00:32
Actually you say your numbers are signed, are you sure you then want to check the carry flag? — Jester, May 08 '16 at 00:42
If, instead of using 32 bit ints, you decide to use 64 bit ints (native to the hardware anyway), what you will find, if value of your numbers is halfway reasonable, is that adding one to the LSW is all you need to do. In practice, you don't even need to check the carry. For values less than 2^63, if you incremented the value continually it would take 2^63 or 10^20 cycles (10^11 seconds) before you got a carry out. There's way too much optimization being considered here. — Ira Baxter, May 08 '16 at 01:04
@Jester, the subroutine could probably be inlined. Good idea. As for your other suggestion I am not sure I understand what you mean by "control the calling convention"? — O.M.Y., May 08 '16 at 01:33
@IraBaxter, I said the elements contain data that can be *treated* as 31-bit signed numbers, but maybe my terminology/understanding is bad. The idea I was thinking of was that if a single increment sets the high-bit of the longword I want to trigger the subroutine. — O.M.Y., May 08 '16 at 01:38
@IraBaxter, The 64-bit int could be used, but it would mean additional I/O overhead either in more storage space or in conversion to/from 32 bits in the file. That is partly why I asked my first "related" question in my post. — O.M.Y., May 08 '16 at 01:43
The most significant bit is actually the sign bit, so you should check the sign flag not the carry flag. As for the calling convention, that applies if you insist on using a call, and can help you in passing and preserving the index register. Since the normal calling convention allows for the argument register to be destroyed you'd have to save it elsewhere. This optimization would only save you a single `mov` instruction which are practically free anyway. — Jester, May 08 '16 at 01:43
@Jester, ah so I was wrong on my terminology. Editing the question. Thanks. — O.M.Y., May 08 '16 at 01:45
Believe me, reading 64 bits from the file will not be distinguishable timewise from reading 32 bits from file. — Ira Baxter, May 08 '16 at 03:26
Is step 3 really "see if the increment set the high bit" or is it "see if the high bit is set"? Because the first one is true only for incrementing 0x7FFFFFFF, the second is is true for source numbers 0x7FFFFFFF to 0xFFFFFFFE. Also "overhead for an array" is what? The memory in asm is like an array naturally. Actually "signed 31bit number" is any of 0x0..0x7FFFFFFF, with sign bit in 31th bit (not in 32th, as ordinary C "int", which is 32 bit number, either signed or unsigned, depends how you treat it). Which means you will have to test 31th bit, INC+JS will not work for you. — Ped7g, May 09 '16 at 17:34
"Would the code run faster on a 32-bit system since the elements are 32 bits?" ... no, I don't think so, the 64b system has more raw horse power. Anyway, increasing 1GB of numbers by `inc`, if you go sequentially, and you have those numbers in memory, is so fast, that you are within tens of milliseconds. Just reading those data from disc/network and writing them will take well over 1sec, so who cares. There's nothing to gain in the steps 2-4. If you think otherwise, write it, profile it, rewrite, profile again... good luck wasting your time in wrong spot. — Ped7g, May 09 '16 at 17:39
Well, besides making it abundantly clear in my question that I was not concerned with file I/O speed (steps 1 & 5) perhaps everyone also missed/ignored the comment above where I mentioned *"returning to step 2 multiple times"*. The 1G of 32-bit elements might be sequentially scanned/adjusted & rescanned/readjusted as many as **100 million** times from a single file load and several million files may be loaded overall. Perhaps that helps everyone see why I am so concerned with shaving 10ms per scan. There are a lot more scans than loads. — O.M.Y., May 10 '16 at 05:43
Your original description does not allow for multiple step 2 (for particular value). Fix your algorithm description first, actually putting here full algorithm with that subroutine would allow us to ponder seriously on it, how to get same result in more efficient way. If you are so sure about the rest of the app being optimal (and I have factual reasons to have some doubts about that), then just use the profiler. I will try to explain it in other way. — Ped7g, May 10 '16 at 08:10
This is not Z80 (sadly, I love it). On top x64 CPU up to some basic instruction count you have execution completely for free (over block of data), because the disc/memory/cache I/O is so slow, that whatever basic calculation (like "inc") you do there, the overall performance is bandwidth limited. So you can easily change inc to add, or do inc+inc+dec with very likely either the very same performance, or performance change coming not from instruction count, but from breaking some cache size/alignment of code. So I would be much more worried about "branch to a routine". — Ped7g, May 10 '16 at 08:14
@Ped7g. Not multiple Step 2 in a row, but rather multiple restarting the entire scan **at** Step 2 triggered after Step 4 and before Step 5. I tried to keep the description of the problem as narrow as possible, seeking only suggestions for strategies to optimize steps 2-4 as a unit. What happens outside of these three steps is not part of my question, as I have repeatedly stated. — O.M.Y., May 10 '16 at 08:20
@Ped7g ... Sorry, I didn't mean to sound rude, but between having to repeat myself and now the "Too Broad" closure flag above I am getting sick of this whole SO process. I'm not mad at you but I am annoyed at people insulting my ability to accurately describe what I need. Another user has also gone so far as to call me a liar [here](http://meta.stackoverflow.com/questions/322751/what-to-do-when-the-question-answered-is-not-the-question-asked) and it ticked me off I'm afraid. — O.M.Y., May 10 '16 at 08:23
@OMY: I see, sorry to sound rude, but you look to be clueless in this case. You either failed to describe your problem properly, or you are? If you are already this far, simply writing few variants and using profiler would either already show you that there is no real difference in whichever way you use to increment the element (memory i/o limited), or you would be able to back your problem with real data about the different performance of different variants. So far I guess the `add [si],1 / jns exit / call routine / inc si / jmp` is as good as wasteful `inc[si] / dec [si] / inc [si]` variant. — Ped7g, May 10 '16 at 11:42

Peter Cordes · Accepted Answer · 2016-05-08T06:32:28.243

2

See the x86 tag wiki for info about lots of stuff.

Why do you want to do all the I/O before writing back? Does your unspecified subroutine need random-access to arbitrary elements? If so, mmap(2) your whole file. (Assuming POSIX system calls).

If not, read(2) into a buffer that's maybe 128kB or so. (smaller than L2 cache). Process that, then pwrite(2) it back to the place you read it from. (or lseek(2)/write(2)).

Does your subroutine maybe abort the whole process, resulting in no modification of the file?

You could use SSE2 for the increment: use PADDD / MOVMSKPS to do four 32bit (dword) adds in parallel and then extract the sign bits. Use test on the mask result to see if any of the elements had their sign bits set. If so, call the subroutine for those elements.

bsf will find the first set bit. There are BMI1 or BMI2 instructions to clear the lowest set bit, IIRC. You might use that to loop over the set bits in the mask.

Or if you find that any elements in a vector had their sign bit set, you could skip the vector store back into the array and instead go through those elements again with scalar code. This has the advantage of having the neighbouring elements in the array in the "correct" state when the subroutine is called.

e.g.

    ;; set up constants
    pcmpeqw   xmm1, xmm1
    psrld     xmm1, 31     ; xmm1 = [ 1 1 1 1]

    ; rsi = start,  rdi = one-past-the-end
    ; or maybe prefer keeping these in regs the subroutine won't clobber
.vectorloop:
    movdqa    xmm0, [rsi]
    paddd     xmm0, xmm1
    movmskps  eax, xmm1       ; pmovmskb would give us the high bit of every byte.  This is just every dword element
    test      eax, eax
    jnz     .at_least_one_sign_bit_set
    movdqa    [rsi], xmm0     ; vector store back, since no elements had sign bits set
.resume_vectorloop:    ; scalar code jumps back here when done
    add       rsi, 16
    cmp       rsi, rdi
    jb       .vectorloop

    jmp     all_done

.at_least_one_sign_bit_set:

    ; Array isn't modified at this point.
    inc     dword [rsi]                ;; or better, load / inc / jns, passing the pointer and index to the subroutine, so it doesn't have to load again after the read-modify-write inc.
    jns ...
        ;; maybe add rsi, 4  here, depending on how we want want to call the subroutine.
    inc     dword [rsi+4]
    jns ...
    ...
    jmp      .resume_vectorloop    ;; or duplicate the tail and cmp/jb to .vectorloop

This assumes your buffer is aligned and the size a multiple of the vector width, so you don't have to care about unaligned or scalar cleanup. You control the buffer, so this should be easy. (except for the length part with mmap, potentially. But it's not a hard problem to solve.)

edited May 08 '16 at 06:32

answered May 08 '16 at 04:26

Peter Cordes

245,674
35
423
606

Wow Peter, That is a lot to digest. Some of it sounds intriguing and I will need to investigate these instructions to see which might be helpful to my goals. In the meantime I will try and answer some of your questions in the next few comments... – O.M.Y. May 08 '16 at 04:44
"Why all the I/O?" I need to examine elements and act on them in the sequence they are stored in the file (what happens with early elements can affect later ones), but only change some of the elements based on various criteria. Also, in answer to your other question I do need random access rather than having to rewrite elements that do not get changed. – O.M.Y. May 08 '16 at 04:49
Can you point me to an example of the vector/scalar approach you are describing? I'm afraid I don't quite grasp that concept. – O.M.Y. May 08 '16 at 04:51
Once an element fails to trigger the sign flag all remaining elements in sequence are skipped for that file. However the subroutine will never abort without modifying at least some elements. The SSE2 approach sounds very promising. I will definitely be looking into that concept first. – O.M.Y. May 08 '16 at 04:56
1

@O.M.Y.: Oh, so calling the subroutine is the common case? Why the hell are you caring so much about the increment part? The vector approach only helps if elements rarely trigger the subroutine! Maybe Jester read your question more carefully, but he's right, this part doesn't vectorize without vectorizing the subroutine as part of it. I did add code with an edit for the case where not calling the subroutine is the fast-path. – Peter Cordes May 08 '16 at 05:14
@O.M.Y.: My last comment was more rud than I meant to be. You wouldn't have asked if it was obvious to you. Anyway, don't worry about the incrementing, just `inc` or `add` and break out of the loop on the `ns` (not signed) condition. – Peter Cordes May 08 '16 at 05:41
No offense taken Peter, this problem is strange I know. **It's like the old adage: "take care of the pennies and the dollars will take care of themselves.:** Yes, calling the subroutine is the common case and since the heartbeat of the process depends on doing a simple "+1 & check flag" *in sequence* on hundreds of thousands to hundreds of millions of elements I am trying to find every little way to save those pennies of time. – O.M.Y. May 08 '16 at 11:56
3

@O.M.Y. "Penny wise and pound foolish" is the old adage I'd use here. You're trying to shave off a cycle or two per longword processed when the other steps of your processing will cost thousands of cycles or more per longword. To put it another way, a 1 GB file has 250M longwords, saving 1 cycle per longword on a 2 GHz CPU would result in saving a total of 125 ms. You wouldn't even notice the difference. Compare that to the time you've wasted on this question. How many times would your program need to run before your hoped for savings paid back what you've spent on this so far? – Ross Ridge May 08 '16 at 16:12
2

@O.M.Y. "take care of the pennies and the dollars will take care of themselves" is not an adage that applies to software. You can spend as much time as you like optimizing an inefficient algorithm and it will still be orders of magnitude slower than a naive unoptimized implementation of a better algorithm. – meager May 08 '16 at 19:07
I appreciate the advice @meagar but the process is what it is. It has been developed carefully but in a high-level language that is not efficient enough. The elements must be analyzed in sequence and reacted to in sequence. The subroutine must be accessed if the increment says so. There is no option for this. The best I can hope for is to shave whatever time I can. Jester's suggestion to inline the subroutine is one of those ways. Can you think of any others? – O.M.Y. May 08 '16 at 19:45
@O.M.Y.: Careful tuning of how you do your I/O is going to matter most, since that's probably going to be the bottleneck unless the subroutine is slow. mmaping the whole file leads to more TLB misses, and actually has more overhead than the copying overhead of reading in small chunks that fit in L2 cache. Unless it makes the subroutine less efficient to deal with the case where the part of the array it needs to modify isn't loaded yet, you might well do better with read/write rather than mmap. – Peter Cordes May 08 '16 at 20:15
@O.M.Y.: see for example http://stackoverflow.com/questions/45972/mmap-vs-reading-blocks, and maybe http://stackoverflow.com/questions/8056984/speeding-up-file-i-o-mmap-vs-read. – Peter Cordes May 08 '16 at 20:22
I did say at the start I was not worried about Steps 1 or 5 but this is good advice @PeterCordes. There actually are some variables that can arise early in the file which sometimes allow a *degree* of prediction as to how much of the file *might* need to be loaded. My prototype code does use this to minimize the I/O activity by preloading a certain number of blocks of the file. Tuning these preloads to a ratio of the L2 size is a great idea! – O.M.Y. May 08 '16 at 21:08

Fastest assembly code to increment a string of longwords on x64 system

1 Answers1