How to implement atoi using SIMD?

Question

I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places.

If it's a speed gain, multiple atoi results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling.

The algorithm I came up with is the following:

I can initialize a vector of length N in the following fashion: [10^N..10^1]
I convert each character in the buffer to an integer and place them in another vector.
I take each number in the significant digits vector and multiply it by the matching number in the numbers vector and sum the results.

I'm targeting x86 and x86-64 architectures.

I know that AVX2 supports three operand Fused Multiply-Add so I'll be able to perform Sum = Number * Significant Digit + Sum.
That's where I got so far.
Is my algorithm correct? Is there a better way?
Is there a reference implementation for atoi using any SIMD instructions set?

If you are trying to do this with x86 SIMD instructions, I recommend you to tag this as [tag:assembly] and [tag:x86] so the people that read the corresponding tag queues see your post. — fuz, Feb 01 '16 at 09:57
Related SSE string-parsing question with some useful techniques: http://stackoverflow.com/a/31683632/224132 (packed-compare -> shuffle mask lookup). That might not be needed here, since you only need the find the end of one string. — Peter Cordes, Feb 01 '16 at 11:04
@FUZxxl Most questions I've seen tag SIMD alongsides C since that's what they are using to implement SIMD operations with. — the_drow, Feb 01 '16 at 11:13
@the_drow: Are you planning to target C with SSE intrinsics, or write a whole function in asm (e.g. with Intel syntax (NASM/YASM) or AT&T syntax (gcc-style))? Those are your two good options. Either way, see the links at http://stackoverflow.com/tags/x86/info. Inline ASM is the 3rd option, but it's a bad choice. Also, I noticed that stgatilov's SSE IPv4 address parser I linked does have some ideas for 1-3 digit strings. — Peter Cordes, Feb 01 '16 at 13:39
I'm targeting C with SSE intrinsics which is why the question was tagged as C in the beginning. That's what RapidJSON uses. — the_drow, Feb 01 '16 at 13:41
@the_drow: I read ["Each JSON value occupies exactly 16/20 bytes for most 32/64-bit machines (excluding text string)."](http://rapidjson.org/md_doc_features.html). That doesn't apply to this case, does it? That would be the space to store the integer itself + metadata after it's parsed? — Peter Cordes, Feb 02 '16 at 13:02
BTW, a quick google for SIMD atoi turned up a few hits: https://software.intel.com/en-us/forums/intel-c-compiler/topic/296952 talks about the same stuff that the answers and comments here have said, mostly. (not as much detail as zx485's answer though). There's also this http://www.mersenneforum.org/showthread.php?t=11590, where a couple people were throwing around some actual code. They're talking about using `double` to handle the math for the full range of 32bit integers. One early post has an apparently full `atoi` that he says takes 70 cycles on core2. — Peter Cordes, Feb 02 '16 at 13:11
I have completed the implementation and tested it. The maximum needed SSE level is 4.2. It could be converted to inline assembly for C pretty easily, I suppose. The updated answer is below. — zx485, Feb 02 '16 at 19:39
Thanks for accepting my answer. I just want to add that I improved the implementation by reducing the constant memory usage to `64 BYTE = One Cache Line` and only sacrificing 2 more cycles. Additionally I added handling for the `'+'` char at the beginning. Hope you like it. — zx485, Feb 02 '16 at 22:55
@zx485: probably the easiest way to integrate this into RapidJSON is to code it translate it into intrinsics, rather than asm. If the compiler does a decent job with it, then that's much easier for portability, and can inline. Otherwise building RapidJSON would require the GNU assembler (or NASM/YASM), which is a problem if portability to MSVC is desired. (It's only a speedup, not an essential part, but still.) I find asm easier to write than intrinsics, too (much easier to type, and you never have to fight the compiler), but intrinsics are the way to go for portability. — Peter Cordes, Feb 03 '16 at 13:29

score 26 · Accepted Answer · edited Nov 12 '17 at 12:36

The algorithm and its implementation is finished now. It's complete and (moderately) tested (Updated for less constant memory usage and tolerating plus-char).

The properties of this code are as follows:

Works for int and uint, from MIN_INT=-2147483648 to MAX_INT=2147483647 and from MIN_UINT=0 to MAX_UINT=4294967295
A leading '-' char indicates a negative number (as sensible), a leading '+' char is ignored
Leading zeros (with or without sign char) are ignored
Overflow is ignored - bigger numbers just wraparound
Zero length strings result in value 0 = -0
Invalid characters are recognized and the conversion ends at the first invalid char
At least 16 bytes after the last leading zero must be accessible and possible security implications of reading after EOS are left to the caller
Only SSE4.2 is needed

About this implementation:

This code sample can be run with GNU Assembler(as) using .intel_syntax noprefix at the beginning
Data footprint for constants is 64 bytes (4*128 bit XMM) equalling one cache line.
Code footprint is 46 instructions with 51 micro-Ops and 64 cycles latency
One loop for removal of leading zeros, otherwise no jumps except for error handling, so...
Time complexity is O(1)

The approach of the algorithm:

- Pointer to number string is expected in ESI
- Check if first char is '-', then indicate if negative number in EDX (**A**)
- Check for leading zeros and EOS (**B**)
- Check string for valid digits and get strlen() of valid chars (**C**)
- Reverse string so that power of 
  10^0 is always at BYTE 15
  10^1 is always at BYTE 14
  10^2 is always at BYTE 13
  10^3 is always at BYTE 12
  10^4 is always at BYTE 11 
  ... 
  and mask out all following chars (**D**)
- Subtract saturated '0' from each of the 16 possible chars (**1**)
- Take 16 consecutive byte-values and and split them to WORDs 
  in two XMM-registers (**2**)
  P O N M L K J I  | H G F E D C B A ->
    H   G   F   E  |   D   C   B   A (XMM0)
    P   O   N   M  |   L   K   J   I (XMM1)
- Multiply each WORD by its place-value modulo 10000 (1,10,100,1000)
  (factors smaller then MAX_WORD, 4 factors per QWORD/halfXMM)
  (**2**) so we can horizontally combine twice before another multiply.
  The PMADDWD instruction can do this and the next step:
- Horizontally add adjacent WORDs to DWORDs (**3**)
  H*1000+G*100  F*10+E*1  |  D*1000+C*100  B*10+A*1 (XMM0)
  P*1000+O*100  N*10+M*1  |  L*1000+K*100  J*10+I*1 (XMM1)
- Horizontally add adjacent DWORDs from XMM0 and XMM1 to XMM0 (**4**)
  xmmDst[31-0]   = xmm0[63-32]  + xmm0[31-0]
  xmmDst[63-32]  = xmm0[127-96] + xmm0[95-64]
  xmmDst[95-64]  = xmm1[63-32]  + xmm1[31-0]
  xmmDst[127-96] = xmm1[127-96] + xmm1[95-64]
- Values in XMM0 are multiplied with the factors (**5**)
  P*1000+O*100+N*10+M*1 (DWORD factor 1000000000000 = too big for DWORD, but possibly useful for QWORD number strings)
  L*1000+K*100+J*10+I*1 (DWORD factor 100000000)
  H*1000+G*100+F*10+E*1 (DWORD factor 10000)
  D*1000+C*100+B*10+A*1 (DWORD factor 1)
- The last step is adding these four DWORDs together with 2*PHADDD emulated by 2*(PSHUFD+PADDD)
  - xmm0[31-0]  = xmm0[63-32]  + xmm0[31-0]   (**6**)
    xmm0[63-32] = xmm0[127-96] + xmm0[95-64]
      (the upper QWORD contains the same and is ignored)
  - xmm0[31-0]  = xmm0[63-32]  + xmm0[31-0]   (**7**)
- If the number is negative (indicated in EDX by 000...0=pos or 111...1=neg), negate it(**8**)

And the sample implementation in GNU Assembler with intel syntax:

.intel_syntax noprefix
.data
  .align 64
    ddqDigitRange: .byte  '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0
    ddqShuffleMask:.byte  15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 
    ddqFactor1:    .word  1,10,100,1000, 1,10,100,1000  
    ddqFactor2:    .long  1,10000,100000000,0
.text    
_start:
   mov   esi, lpInputNumberString
   /* (**A**) indicate negative number in EDX */
   mov   eax, -1
   xor   ecx, ecx
   xor   edx, edx
   mov   bl,  byte ptr [esi]
   cmp   bl,  '-'
   cmove edx, eax
   cmp   bl,  '+'
   cmove ecx, eax
   sub   esi, edx
   sub   esi, ecx
   /* (**B**)remove leading zeros */
   xor   eax,eax               /* return value ZERO */
  remove_leading_zeros:
   inc   esi
   cmp   byte ptr [esi-1], '0'  /* skip leading zeros */
  je remove_leading_zeros
   cmp   byte ptr [esi-1], 0    /* catch empty string/number */
  je FINISH
   dec   esi
   /* check for valid digit-chars and invert from front to back */
   pxor      xmm2, xmm2         
   movdqa    xmm0, xmmword ptr [ddqDigitRange]
   movdqu    xmm1, xmmword ptr [esi]
   pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */
  jo FINISH             /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */
   mov al , '0'         /* value to subtract from chars */
   sub ecx, 16          /* len-16=negative to zero for shuffle mask */
   movd      xmm0, ecx
   pshufb    xmm0, xmm2 /* broadcast CL to all 16 BYTEs */
   paddb     xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */
   pshufb    xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */
   movd      xmm0, eax                         /* AL='0' from above */
   pshufb    xmm0, xmm2                        /* broadcast AL to XMM0 */
   psubusb   xmm1, xmm0                        /* (**1**) */
   movdqa    xmm0, xmm1
   punpcklbw xmm0, xmm2                        /* (**2**) */
   punpckhbw xmm1, xmm2
   pmaddwd   xmm0, xmmword ptr [ddqFactor1]    /* (**3**) */
   pmaddwd   xmm1, xmmword ptr [ddqFactor1]
   phaddd    xmm0, xmm1                        /* (**4**) */
   pmulld    xmm0, xmmword ptr [ddqFactor2]    /* (**5**) */
   pshufd    xmm1, xmm0, 0b11101110            /* (**6**) */
   paddd     xmm0, xmm1
   pshufd    xmm1, xmm0, 0b01010101            /* (**7**) */
   paddd     xmm0, xmm1
   movd      eax, xmm0
   /* negate if negative number */              
   add       eax, edx                          /* (**8**) */
   xor       eax, edx
  FINISH:
   /* EAX is return (u)int value */

The result of Intel-IACA Throughput Analysis for Haswell 32-bit:

Throughput Analysis Report
--------------------------
Block Throughput: 16.10 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 9.5    0.0  | 10.0 | 4.5    4.5  | 4.5    4.5  | 0.0  | 11.1 | 11.4 | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
|   0*   |           |     |           |           |     |     |     |     |    | xor ecx, ecx
|   0*   |           |     |           |           |     |     |     |     |    | xor edx, edx
|   1    |           | 0.1 |           |           |     |     | 0.9 |     |    | dec eax
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov bl, byte ptr [esi]
|   1    |           |     |           |           |     |     | 1.0 |     | CP | cmp bl, 0x2d
|   2    | 0.1       | 0.2 |           |           |     |     | 1.8 |     | CP | cmovz edx, eax
|   1    | 0.1       | 0.5 |           |           |     |     | 0.4 |     | CP | cmp bl, 0x2b
|   2    | 0.5       | 0.2 |           |           |     |     | 1.2 |     | CP | cmovz ecx, eax
|   1    | 0.2       | 0.5 |           |           |     |     | 0.2 |     | CP | sub esi, edx
|   1    | 0.2       | 0.5 |           |           |     |     | 0.3 |     | CP | sub esi, ecx
|   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
|   1    | 0.3       | 0.1 |           |           |     |     | 0.6 |     | CP | inc esi
|   2^   | 0.3       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.6 |     |    | cmp byte ptr [esi-0x1], 0x30
|   0F   |           |     |           |           |     |     |     |     |    | jz 0xfffffffb
|   2^   | 0.6       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.4 |     |    | cmp byte ptr [esi-0x1], 0x0
|   0F   |           |     |           |           |     |     |     |     |    | jz 0x8b
|   1    | 0.1       | 0.9 |           |           |     |     |     |     | CP | dec esi
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqa xmm0, xmmword ptr [0x80492f0]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | movdqu xmm1, xmmword ptr [esi]
|   0*   |           |     |           |           |     |     |     |     |    | pxor xmm2, xmm2
|   3    | 2.0       | 1.0 |           |           |     |     |     |     | CP | pcmpistri xmm0, xmm1, 0x14
|   1    |           |     |           |           |     |     | 1.0 |     |    | jo 0x6e
|   1    |           | 0.4 |           |           |     | 0.1 | 0.5 |     |    | mov al, 0x30
|   1    | 0.1       | 0.5 |           |           |     | 0.1 | 0.3 |     | CP | sub ecx, 0x10
|   1    |           |     |           |           |     | 1.0 |     |     | CP | movd xmm0, ecx
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm0, xmm2
|   2^   |           | 1.0 | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | paddb xmm0, xmmword ptr [0x80492c0]
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm1, xmm0
|   1    |           |     |           |           |     | 1.0 |     |     |    | movd xmm0, eax
|   1    |           |     |           |           |     | 1.0 |     |     |    | pshufb xmm0, xmm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | psubusb xmm1, xmm0
|   0*   |           |     |           |           |     |     |     |     | CP | movdqa xmm0, xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | punpcklbw xmm0, xmm2
|   1    |           |     |           |           |     | 1.0 |     |     |    | punpckhbw xmm1, xmm2
|   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
|   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
|   3    |           | 1.0 |           |           |     | 2.0 |     |     | CP | phaddd xmm0, xmm1
|   3^   | 2.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmulld xmm0, xmmword ptr [0x80492e0]
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xee
|   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0x55
|   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
|   1    | 1.0       |     |           |           |     |     |     |     | CP | movd eax, xmm0
|   1    |           |     |           |           |     |     | 1.0 |     | CP | add eax, edx
|   1    |           |     |           |           |     |     | 1.0 |     | CP | xor eax, edx
Total Num Of Uops: 51

The result of Intel-IACA Latency Analysis for Haswell 32-bit:

Latency Analysis Report
---------------------------
Latency: 64 Cycles

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

The Resource delay is counted since all the sources of the instructions are ready
and until the needed resource becomes available

| Inst |                 Resource Delay In Cycles                  |    |
| Num  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  | FE |    |
-------------------------------------------------------------------------
|  0   |         |    |         |         |    |    |    |    |    |    | xor eax, eax
|  1   |         |    |         |         |    |    |    |    |    |    | xor ecx, ecx
|  2   |         |    |         |         |    |    |    |    |    |    | xor edx, edx
|  3   |         |    |         |         |    |    |    |    |    |    | dec eax
|  4   |         |    |         |         |    |    |    |    | 1  | CP | mov bl, byte ptr [esi]
|  5   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2d
|  6   |         |    |         |         |    |    |    |    |    | CP | cmovz edx, eax
|  7   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2b
|  8   |         |    |         |         |    |    | 1  |    |    | CP | cmovz ecx, eax
|  9   |         |    |         |         |    |    |    |    |    | CP | sub esi, edx
| 10   |         |    |         |         |    |    |    |    |    | CP | sub esi, ecx
| 11   |         |    |         |         |    |    |    |    | 3  |    | xor eax, eax
| 12   |         |    |         |         |    |    |    |    |    | CP | inc esi
| 13   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x30
| 14   |         |    |         |         |    |    |    |    |    |    | jz 0xfffffffb
| 15   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x0
| 16   |         |    |         |         |    |    |    |    |    |    | jz 0x8b
| 17   |         |    |         |         |    |    |    |    |    | CP | dec esi
| 18   |         |    |         |         |    |    |    |    | 4  |    | movdqa xmm0, xmmword ptr [0x80492f0]
| 19   |         |    |         |         |    |    |    |    |    | CP | movdqu xmm1, xmmword ptr [esi]
| 20   |         |    |         |         |    |    |    |    | 5  |    | pxor xmm2, xmm2
| 21   |         |    |         |         |    |    |    |    |    | CP | pcmpistri xmm0, xmm1, 0x14
| 22   |         |    |         |         |    |    |    |    |    |    | jo 0x6e
| 23   |         |    |         |         |    |    |    |    | 6  |    | mov al, 0x30
| 24   |         |    |         |         |    |    |    |    |    | CP | sub ecx, 0x10
| 25   |         |    |         |         |    |    |    |    |    | CP | movd xmm0, ecx
| 26   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm0, xmm2
| 27   |         |    |         |         |    |    |    |    | 7  | CP | paddb xmm0, xmmword ptr [0x80492c0]
| 28   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm1, xmm0
| 29   |         |    |         |         |    | 1  |    |    |    |    | movd xmm0, eax
| 30   |         |    |         |         |    | 1  |    |    |    |    | pshufb xmm0, xmm2
| 31   |         |    |         |         |    |    |    |    |    | CP | psubusb xmm1, xmm0
| 32   |         |    |         |         |    |    |    |    |    | CP | movdqa xmm0, xmm1
| 33   |         |    |         |         |    |    |    |    |    | CP | punpcklbw xmm0, xmm2
| 34   |         |    |         |         |    |    |    |    |    |    | punpckhbw xmm1, xmm2
| 35   |         |    |         |         |    |    |    |    | 9  | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
| 36   |         |    |         |         |    |    |    |    | 9  |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
| 37   |         |    |         |         |    |    |    |    |    | CP | phaddd xmm0, xmm1
| 38   |         |    |         |         |    |    |    |    | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0]
| 39   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0xee
| 40   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
| 41   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0x55
| 42   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
| 43   |         |    |         |         |    |    |    |    |    | CP | movd eax, xmm0
| 44   |         |    |         |         |    |    |    |    |    | CP | add eax, edx
| 45   |         |    |         |         |    |    |    |    |    | CP | xor eax, edx

Resource Conflict on Critical Paths: 
-----------------------------------------------------------------
|  Port  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  |
-----------------------------------------------------------------
| Cycles | 0    0  | 0  | 0    0  | 0    0  | 0  | 0  | 1  | 0  |
-----------------------------------------------------------------

List Of Delays On Critical Paths
-------------------------------
6 --> 8 1 Cycles Delay On Port6

An alternative handling suggested in comments by Peter Cordes is replacing the last two add+xor instructions by an imul. This concentration of OpCodes is likely to be superior. Unfortunately IACA doesn't support that instruction and throws a ! - instruction not supported, was not accounted in Analysis comment. Nevertheless, although I like the reduction of OpCodes and reduction from (2uops, 2c latency) to (1 uop, 3c latency - "worse latency, but still one m-op on AMD"), I prefer to leave it to the implementer which way to choose. I haven't checked if the following code is sufficient for parsing any number. It is just mentioned for completeness and code modifications in other parts may be necessary (especially handling positive numbers).

The alternative may be replacing the last two lines with:

  ...
  /* negate if negative number */              
   imul eax, edx
  FINISH:
  /* EAX is return (u)int value */

"each WORD in the two QWORDs"?? A word in x86-terminology is 16b. An xmm register holds a double-quad word. Just say register if you're talking about a reg rather than an element. You're never actually doing anything with 64b element size, so don't say anything about qwords. I like your idea of delaying multiply by larger powers of 10 until after combining up to DWORD elements! — Peter Cordes, Feb 01 '16 at 14:20
@Peter Cordes: I'm not sure what you mean by that. An XMM register is 128-bit wide, a QWORD is 64-bit wide, a DWORD is 32-bit wide, a WORD is 16-bit wide and a BYTE is 8-bit wide. Therefore a 128-bit register could be considered as containing two 64-bit values(QWORDs). I chose that expression because there are four factors 1,10,100,1000 each WORD wide and applied to half an XMM register, a QWORD=4*WORD. I did that for clarity, but may have failed in that regard. — zx485, Feb 01 '16 at 14:25
Ya, I figured out what you were up to once I read it more closely, and understood that there really was a grouping into two qword-sized halves. I editted in a wording change which makes it sound clearer to me. It's your answer, so change it back if you like. :P Also, interesting use of `phaddd`. This is one of the very rare cases where it's actually useful for more than saving code-size and typing at the expense of speed, compared to `pshufd`/`paddd` (or `vpsrldq`) for doing a horizontal sum, since you actually care about what's where at the intermediate step. — Peter Cordes, Feb 01 '16 at 14:37
But the last two `phaddd`s should be `pshufd` / `paddd`: two 1-uop 1c latency instructions instead of one 3-uop instruction with 2c latency. (http://agner.org/optimize/). Also note that `phaddd` requires SSSE3 (added with Core2). SSE3 was mostly FP stuff. I wonder how they type if they ran out of fingers for counting at 3? SSE4.1 `PMOVZXBW` could replace the `punpckl`, but not the `punpckh`, so you still need a zeroed register, so there's little point. It'd be nice if there was a `pmovzx` with an an immediate operand to select an offset into the register to expand from. — Peter Cordes, Feb 01 '16 at 14:42
The editings you've done have a more mathematical wording than my original formulation. My approach had been focused on blocks of bits, yours on mathematical operators. However, concerning the `pshufd` / `paddd` situation: if you like to add your approach please do not replace the one given, but rather add the alternative (at the end or wherever appropriate). — zx485, Feb 01 '16 at 14:51
Anyway, now that you've got what's probably the most efficient way to do the actual atoi part, there's the tricky part of masking out bytes beyond the end of the string. `pcmpeqb` against a vector of all-zero, maybe? Then do pmovmskb / bit-scan to find the position? Or if you're parsing it from a bigger buffer, you already have the length of the string. Filling the rest of the string/vector with ASCII `'0'` one way or another, or integer zero after the subtract, would work. Maybe use the length as an index into a table of masks? — Peter Cordes, Feb 01 '16 at 14:51
@Peter Cordes: I thought about that and - putting the latency and throughput aside - would prefer an approach with `PCMPISTRM`. Unfortunately I don't have a PC available with more than SSE2. Otherwise I would have implemented it by now. Thinking these string instructions through without testing is really hard :) — zx485, Feb 01 '16 at 14:56
I wasn't going to edit your implementation. The comment is enough for me. I'll leave it up to you to speed up your implementation. `phaddd` is slower than two simpler instructions on Intel P6-family, SnB-family, Silvermont, and AMD Bulldozer-family. I didn't check the insn table for the others. Hmm, interesting point about the SSE4.2 string instructions. Also, Intel has a simulator (`sde`). You can run it on your binary, and it will trap&emulate new instruction sets. — Peter Cordes, Feb 01 '16 at 14:56
I believe you (and know) that `phaddd` is slower. I was just trying to encourage you to provide some code, because I have checked out many alternatives... ;-) btw SDE is nice, but IIRC you can't debug the code run with it :-( — zx485, Feb 01 '16 at 15:04
ah ok. Well you just use `pshufd xmm1, xmm0, 0x3 <<2 + 0x2` (or `movhlps`) to get the high two dwords into the low position of another vector, then `paddd xmm0, xmm1`. You're emulating `psrldq xmm0, 8` and then `psrldq xmm0, 4`, but with non-destructive operations. If you had AVX, you'd just use `vpsrldq xmm1, xmm0, 8`. Since you're just taking the low dword result anyway, it doesn't matter if you end up with garbage or zeros in other elements, as long as you avoid false dependencies (so movhlps isn't good in that respect, because it merges into the dest reg). — Peter Cordes, Feb 01 '16 at 15:17
I just checked on what `pcmpistrm` can do: It can check for characters that are in a range, so you can use it to mask the digit-string out of the original source, without tokenizing or copying it out of the source buffer first. It's only 3 uops for p0, but it's high latency: 9c lat. (Intel Skylake). It is the fastest of the four pcmp string insns, on Intel and AMD, though. Although to be robust, atoi needs to handle strings longer than 16B. Leading zeros are allowed, and so is just plain overflow from huge numbers. — Peter Cordes, Feb 01 '16 at 15:25
@PeterCordes AVX is allowed as another optional instruction set to support by RapidJSON. So we can #ifdef if we need to. — the_drow, Feb 01 '16 at 17:50
Awesome update. Scalar loop to skip leading zeros makes a ton of sense, nice one. Optimizing for the no leading zeros case is the right move for sure, so doing it with a loop that used SSE to check 16B at once would only be faster for the extremely rare cases of many leading zeros. BTW, style note: indent your insns one level to the right of your labels. I also like to tab the operands to the same text column, so you can look up/down to see what registers nearby instructions write without visually dealing with the ragged alignment. — Peter Cordes, Feb 03 '16 at 13:37
Put `-1` into `eax` with `mov eax, -1` instead of `xor` / `dec`. uop-cache space is more limited and precious than L1 I-cache, so optimize for fused-domain uops, not code-size. Negate with `imul eax, edx` instead of `add`/`xor`. It's 1 uop, 3c latency, instead of 2uops, 2c latency. (worse latency, but still one m-op on AMD). Latency is prob. irrelevant: While RapidJSON is parsing, it's just storing the result and parsing the next integer, not using the resulting integer for dependent calculations. — Peter Cordes, Feb 03 '16 at 13:51
You could save memory for constants by only storing half of `ddqFactor1`, and loading it into a register with `movddup`. IDK if you can generate `ddqFactor2` from it with only one or two insns. Prob. not worth it since you already touch a cacheline for other constants. Does this give the same results as `atoi` for a really long (without leading zeros) string of digits that will overflow a 32bit integer? 100% correct / consistent handling of invalid inputs may be an essential feature for a parsing library. — Peter Cordes, Feb 03 '16 at 13:55
Thanks for your suggestions. They are appreciated. I did incorporate some of them in the answer. What I did not do is using `movddup` to reduce memory usage - for a specific reason: the memory usage for constants so far is `64 BYTE = one cache line` aligned at a 64 BYTE boundary - reducing it won't be beneficial IMHO. It'd just complicate allocation/implementation without any serious benefit. — zx485, Feb 03 '16 at 18:52
You should prob. use `iaca -no_interiteration` for the throughput analysis. It's saying 16 cycles because it thinks you're asking it to analyze a loop. Also note that it doesn't count up fused-domain total ups for you. IACA's count is total unfused-domain for all execution ports. Fused-domain is what matters for the uop-cache, and for the 4-wide pipeline width (issue and retirement). Also, IACA doesn't support `imul`? wow, I knew it had some gaps (e.g. MMX), but that's pretty silly. Take IACA's predictions with a grain of salt, of course. It doesn't model the HW exactly by any means. — Peter Cordes, Feb 04 '16 at 11:39

score 5 · Answer 2 · answered Feb 01 '16 at 09:49

5

I would approach this problem like this:

Initialize the accumulator to 0.
Load the next four characters of the string into an SSE register.
Subtract the value '0' from each character.
Find the first value in the vector whose unsigned value is greater than 9.
If a value was found, shift the components of the vector to the right so that the value found in the previous step was just shifted out.
Load a vector containing powers of ten (1000, 100, 10, 1) and multiply with that.
Compute the sum of all entries in the vector.
Multiply the accumulator with an appropriate value (depending on the number of shifts in step 5) and add the vector. You can use an FMA instruction for that, but I don't know if such an instruction exists for integers.
If no value greater than 9 was found in step four, goto step 2.
Return the accumulator.

You could simplify the algorithm by zeroing out all entries beginning with the wrong one in step 5 instead of shifting and then dividing by an appropriate power of ten in the end.

Please keep in mind that this algorithm reads past the end of the string and is thus not a drop-in replacement for atoi.

answered Feb 01 '16 at 09:49

fuz

76,641
24
165
316

How do I align the string correctly so that the algorithm won't read pass the string? – the_drow Feb 01 '16 at 10:50
1

@the_drow: you can't easily - you're trying to use SIMD in an awkward way, for something that it's not really suited to. SIMD is designed for "vertical" operations rather than "horizontal" operations like this. You will need to ensure that your input string is padded out to a multiple of 16 bytes. You could copy it to a temporary padded buffer before processing though, if you really want a robust implementation (i.e. if this is not just a learning exercise). – Paul R Feb 01 '16 at 10:55
1

The only SSE or AVX integer FMAs are not useful for this: `PMADDWD`: vertical multiply of packed words then horizontally add pairs of adjacent dwords. And `PMADDUBSW`: vertical multiply of unsigned bytes in the first operand with the corresponding signed byte of the 2nd operand, then horizontally add adjacent pairs of signed words (with signed saturation). One of the AVX512 extensions has some integer FMA stuff, IIRC. – Peter Cordes Feb 01 '16 at 10:58
1

@the_drow: See this Q&A: http://stackoverflow.com/questions/34306933/vectorizing-with-unaligned-buffers-using-vmaskmovps-generating-a-mask-from-a-m. Your other option is to make sure your string buffer is aligned by 16, so you can't cross a page boundary by reading a full 16B. (Or even a cache-line boundary). – Peter Cordes Feb 01 '16 at 11:01
@PeterCordes Segments may end on any byte boundary. Some systems (like OpenBSD) use this for their execution-protection scheme. – fuz Feb 01 '16 at 11:05
I think copying to a local padded buffer is going to be better - that way you can pad it with 0s, which might make the processing loop simpler. – Paul R Feb 01 '16 at 11:05
@PaulR If you copy to a local buffer, it might be faster to do the conversion normally, a multiplication is only four cycles. – fuz Feb 01 '16 at 11:06
@FUZxxl: indeed - I suspect this may just be learning exercise though, so whether there is any performance benefit may not matter too much. – Paul R Feb 01 '16 at 11:07
1

@PaulR: If you write your local buffer one byte at a time while looking for the end of the string, you have to eat the latency of a store-forwarding stall. Not a throughput problem directly, though. Anyway, I think if there was a perf benefit to be had in the general case, atoi would already be implemented this way. Good point about a learning exercise, though. It's certainly a problem with easy-to-verify results, and an existing known-good implementation in libc. – Peter Cordes Feb 01 '16 at 11:08
This is both a learning exercise and an attempt at providing a robust implementation for RapidJSON. The purpose of this exercise is to be able to implement faster versions of both atoi and atof which RapidJSON requires. Even if that's not possible I'd still learn something new. – the_drow Feb 01 '16 at 11:18
@FUZxxl Can you please direct me to the instructions required in each step? I'm not familiar at all with SIMD. I just read about what can be done with it. – the_drow Feb 01 '16 at 11:19
@PeterCordes https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2 says that AVX2 supports three-operand fused multiply-accumulate support (FMA3). Isn't it in the form of a = b *c + a? – the_drow Feb 01 '16 at 11:22
It does look like I can use VFMADD231SS for steps 7 & 8. I know it doesn't support integers but floats but that could still help no? – the_drow Feb 01 '16 at 11:24
1

I suggest you stick to integer arithmetic for this and forget about FMA (and AVX for that matter) - everything you need is available in SSE, and for the typical 32 bit use case you will be processing less than 16 characters. – Paul R Feb 01 '16 at 11:54
@the_drow No. Because that's the tedious part. Also, I can't do that because you haven't even told us what compiler, what instrinsics and for what processor you want to have this answered for. If you don't mind I can write up an answer for MMIX because I'm particularly familiar with that ISA. – fuz Feb 01 '16 at 12:06
1

The question's tagged SSE, and he's been talking about Intel instructions like `VFMADD231SS`. But @the_drow: no, you don't want to convert to floating point and back just so you can save one integer instruction! Just use `pmulld` / `paddd` to mul and add vectors of 32bit integers. Although `PMADDWD` might actually work well as the first step of a horizontal sum, at least for the first chunk of digits where all the place values fit into 16bit integers. – Peter Cordes Feb 01 '16 at 12:59
1

@PeterCordes The question itself doesn't state what SIMD instructions he wants to talk about and the tag might be incorrect (as is often the case), which is why I'm asking. – fuz Feb 01 '16 at 13:01
2

@FUZxxl: It does say "I know that AXV2 supports ...". But still, I'd give you another +1 if I could for subtly criticizing the OP's lack of specificity in the question, since that doesn't explicitly say what he's targeting. It does matter which level of SSE instructions he is willing to assume, and potentially which microarch he's optimizing for. Also whether he can usefully `atoi` multiple strings in parallel. (Although in practice, the shuffling overhead to get one character at a time from 4 or 16 strings into a vector would be killer.) – Peter Cordes Feb 01 '16 at 13:05
I can't add more than 5 tags so I can't add FMA as well. I specified which architectures I'm targeting. Anything else? – the_drow Feb 01 '16 at 13:34
@PeterCordes I can run ```atoi``` multiple strings in parallel. – the_drow Feb 01 '16 at 13:39
@the_drow What SIMD instruction sets are you targetting? MMX? SSE? SSE2? Please be specific as the answer might be different depending on which instructions we can use. – fuz Feb 01 '16 at 13:53
1

RapidJSON currently implements speedups using SSE2 and SSE4.2. We can add AVX2 if needed. – the_drow Feb 01 '16 at 14:02
@the_drow What instruction set do you want to have an answer for? – fuz Feb 01 '16 at 14:03
@the_drow: So you're hoping to make something that's usable in production in RapidJSON? That would have helped a *lot* in figuring out what kind of thing you were looking for. You should have said for RapidJSON in the original question, so we'd know the context: as part of a C++ JSON library. That makes a big difference. If you can also say anything about how easy it is to make alignment guarantees for strings to be converted, that would help, too. For multiple conversions in parallel, you'd prob. need to interleave a, b, c, and d into vectors like `{a0, b0, c0, d0}`, `{a1 b1 c1 d1}`, ... – Peter Cordes Feb 01 '16 at 14:11
@PeterCordes I'll try to provide more context in the question. – the_drow Feb 01 '16 at 14:15

How to implement atoi using SIMD?

2 Answers2

Linked