Can I atomically increment a 16 bit counter on x86/x86_64?

Question

I want to save memory by converting an existing 32 bit counter to a 16 bit counter. This counter is atomically incremented/decremented. If I do this:

What instructions do I use for atomic_inc(uint16_t x) on x86/x86_64?
Is this reliable in multi-processor x86/x86_64 machines?
Is there a performance penalty to pay on any of these architectures for doing this?
If yes for (3), what's the expected performance penalty?

Thanks for your comments!

Unless you've got a lot of counters (and that's a lot as in "megabytes") that seems to be an awful lot of effort to save 2 bytes. What is the *actual* problem that you're trying to solve here? — Timo Geusch, Oct 09 '09 at 06:04
Yeah, I have a *lot* of these counters amounting in megabytes. Each such counter represents pending operations on a corresponding block of memory. When the counter goes down to zero, I am supposed to trigger another operation. — Sudhanshu, Oct 09 '09 at 06:23
Possible duplicate of [Can num++ be atomic for 'int num'?](https://stackoverflow.com/questions/39393850/can-num-be-atomic-for-int-num) — phuclv, Jan 14 '18 at 08:42

score 4 · Accepted Answer · edited Dec 07 '17 at 13:22

4

Here's one that uses GCC assembly extensions, as an alternative to Steve's Delphi answer:

uint16_t atomic_inc(uint16_t volatile* ptr)
{
    uint16_t value(1);
    __asm__("lock xadd %w0, %w1" : "+r" (value) : "m" (*ptr));
    return ++value;
}

Change the 1 with -1, and the ++ with --, for decrement.

edited Dec 07 '17 at 13:22

phuclv

27,258
11
104
360

answered Oct 09 '09 at 06:25

Chris Jester-Young

206,112
44
370
418

Thanks, but I am really talking about AMD64/Pentiums here. :) – Sudhanshu Oct 09 '09 at 06:35
That's cool. I'm still giving you a C alternative in case you don't want to code with Delphi. :-) – Chris Jester-Young Oct 09 '09 at 06:36
(Well, change the initializers from `uint_t value(1)` to `uint_t value = 1` for C (my C++ habits are getting to me), but yeah. :-P) – Chris Jester-Young Oct 09 '09 at 06:38
3

`lock inc` / `setcc` will do the trick, because `inc` sets flags according to the result (i.e. the value it writes to memory). In GNU C, [gcc6 introduced an extension](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#FlagOutputOperands) that allows returning flags, so it could inline to a `lock inc` / `jcc` instead of a `lock inc` / `setcc` / `test/jcc`. But without that, `lock xadd` is probably good: smaller code size and fewer insns than `lock inc` / `setcc`. – Peter Cordes Aug 18 '16 at 17:55

score 3 · Answer 2 · edited Jan 15 '18 at 04:32

3

Here is a Delphi function that works:

function LockedInc( var Target :WORD ) :WORD;
asm
        mov     ecx, eax
        mov     ax, 1
   Lock xadd    [ecx], ax
        Inc     eax
end;

I guess you could convert it to whichever language you require.

edited Jan 15 '18 at 04:32

phuclv

27,258
11
104
360

answered Oct 09 '09 at 06:13

Steve

1,689
2
22
31

To clarify for non Delphi/BASM users, I would add that in this (32 bit) routine, a pointer to Target will be passed in in EAX, and that the return value of the function will be in AX. – PhiS Oct 09 '09 at 07:06
Are you intentionally leaving the high half of the return value = the high half of the input pointer, or is that a bug? You might want `mov ecx, 1` (not just CX) / `lock xadd [eax], cx` / `lea eax, [ecx+1]`. (That has a [partial-register stall on old Intel CPUs (Nehalem and earlier)](https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers) from reading `ecx` after writing `cx`. Use separate movzx + inc instructions to avoid that if needed.) – Peter Cordes Jan 15 '18 at 15:45

score 0 · Answer 3 · answered Jul 28 '10 at 02:34

0

The simplest way to perform an atomic increase is as follows (this is inline ASM):

asm
  lock inc dword ptr Counter;
end;

where J is an integer. This will directly increase Counter in its memory location.

I have tested this with brute force and it works 100%.

answered Jul 28 '10 at 02:34

IamIC

16,207
18
81
142

1

The OP wants a 16-bit counter, but yes this will work with `word` instead of `dword`. BTW, the only way you could say something "works 100%" based on brute force testing (without also checking the manuals) is if you tested it on every current *and future* x86 CPU. Anything where the `lock` prefix doesn't fault is 100% guaranteed atomic (I think; at least for instructions where the `lock` prefix is documented to apply). But for example `movdqa [eax], xmm0` is atomic on some CPUs but not others, so testing on Core2 wouldn't reveal the problems on some multi-socket Opterons. – Peter Cordes Jan 15 '18 at 15:47

score -1 · Answer 4 · answered Oct 09 '09 at 09:13

-1

To answer the other three questions:

Didn't find a way to make a numbered list starting with 2
Yes, this is reliable in a multiprocessor environment
Yes, there is a performance penalty
The "lock" prefix locks down the busses, not only for the processor, but for any external hardware, which may want to access the bus via DMA (mass storage, graphics...). So it is slow, typically ~100 clock cycles, but it may be more costly. But if you have "megabytes" of counters, chances are, you will be facing a cache miss, and in this case you will have to wait about ~100 clocks anyway (the memory access time), in case of a page miss, several hundred, so the overhead from lock might not matter.

answered Oct 09 '09 at 09:13

Gunther Piez

28,058
6
62
101

Thanks for your answer. It comes closest to what I was looking for after Chris's reply. – Sudhanshu Oct 12 '09 at 14:05
3

I believe the bus locking is a thing of days of yore. Current generation processors do cache line locking instead: that also ties in neatly with the MESI or MESI like protocol used for cache coherence. – terminus Jun 07 '10 at 11:28
2

Downvoted because there is no performance penalty, or maybe a trivial one even for a single atomic counter. With a big array of atomic counters, 16-bit operand size should be a big win. Byte and word operations are supported natively by x86 cache hardware, so I don't expect any problems at all if the words are 16-bit aligned. `lock inc word [mem]` should have basically identical performance to `lock inc dword [mem]` (ignore cache misses). – Peter Cordes Jan 15 '18 at 15:56

Can I atomically increment a 16 bit counter on x86/x86_64?

4 Answers4