Hex is an ASCII serialization format for binary. You're going to want to first convert from ASCII-decimal to binary integers in registers. Then convert that binary to hex. Hex != binary.
binary -> hex is easy; each binary byte converts separately to two ASCII hex digits. (Or each dword to 8 hex digits). See How to convert a binary integer number to a hex string? for a simple loop, and for efficient ways using SSE2, SSSE3, AVX2, AVX512F, or AVX512VBMI to convert 64 bits of input at a time into 16 bytes of hex, or with AVX2 even do your whole 128-bit / 16-byte input in one step and produce all 32 bytes of hex digits.
That just leaves the decimal-ASCII -> unsigned __int128
input problem. 128-bit shift with shld
/.../shl
(starting with the high dword) and add with add/adc/adc/adc
(starting with the low dword) are straightforward, so you can implement the usual total = total * 10 + digit
(NASM Assembly convert input to integer?) but with extended-precision 128-bit integer math. It takes 4x 32-bit registers to hold a 128-bit integer.
Implement t*10
as t*2 + t*8 = (t*2) + (t*2)*4
by first doubling using either 3x shld
and add eax,eax
, or add eax,eax
+ 3x adc same,same
. Then copy and shift by another 2, then add the two 128-bit numbers together.
But with only 7 GP integer registers (not counting the stack pointer), you'd have to spill something to memory. And you also want your string input pointer in a register.
So probably you'd want to left-shift by 1 in your 4x registers, then spill them to memory and shift by another 2 in registers. Then add
/3xadc
from the stack buffer where you spilled them. That lets you multiply a 128-bit integer in 4 regs by 10 without using any extra registers.
; input: total = 128-bit integer in EBX:ECX:EDX:EAX
; 16-byte tmp buffer at [esp]
; result: total *= 10 in-place
; clobbers: none
; it's traditional to keep a 64-bit integer in EDX:EAX, e.g. for div or from mul
; I chose EBX:ECX for the high half so it makes an easy-to-remember pattern.
;;; total *= 2 and copy to tmp buf
add eax, eax ; start from the low element for carry propagation
mov [esp + 0], eax
adc edx, edx
mov [esp + 4], edx
adc ecx, ecx
mov [esp + 8], ecx
adc ebx, ebx
mov [esp + 12], ebx
;;; shift that result another 2 to get total * 8
shld ebx, ecx, 2 ; start from the high element to pull in unmodified lower bits
shld ecx, edx, 2
shld edx, eax, 2
shl eax, 2
;;; add total*2 from memory to total*8 in regs to get total*10
add eax, [esp + 0]
adc edx, [esp + 4]
adc ecx, [esp + 8]
adc ebx, [esp + 12]
Out-of-order execution is very helpful here. Notice that in the shld
block, the instructions don't actually depend on the previous shld
. They pull in bits from unmodified lower elements. As soon as the first add eax,eax
runs, shl eax,2
can run (if the front-end has already issued it).
Register renaming makes it possible to run that SHL without stalling for a WAR (Write-after-read) hazard. The shld edx, eax, 2
also needs EAX as an input, but the whole point of register renaming is to let the CPU track that version of EAX separately from the output of the shl eax,2
.
This lets us write code that doesn't use many architectural registers (just these 4), but still takes advantage of more physical registers to let the shld/shl block execute in the opposite order from program order, as inputs become ready from the add/adc block.
This is great because it means that the final add/adc block (adding from memory) has its inputs ready in the order it needs them, without serializing the latencies of either chain of instructions. This is good because shld
has 3 cycle latency on current Intel CPUs (like Haswell/Skylake), up from 1 on Sandybridge/IvyBridge. (It was a 2-uop instruction with 2c latency on Nehalem and earlier). But on Haswell/Skylake it's still 1 uop with 1-per-clock throughput. (port 1 only)
Ryzen has slower shld
: 6 uops, 3 cycle latency, one-per-3-cycle throughput. (https://agner.org/optimize/)
We effectively can have 3 add or shift chains in flight at once even though in program order each block is done separately. And once we add in the new digit with a 4th block, it can be in flight, too.
Example loop. Enter it with EBX:ECX:EDX = 0 and EAX = the first digit, ready to check the 2nd character for being a digit and then do total = t*10 + digit
.
.digit_loop:
... earlier block ; total *= 10
add eax, ebp ; total += digit
adc edx, 0
adc ecx, 0
adc ebx, 0
.loop_entry_point:
inc esi
movzx ebp, byte ptr [esi] ; load a new input digit
sub ebp, '0' ; ASCII digit -> 0..9 integer
cmp ebp, 9 ; unless it was out of range
jbe .digit_loop
;else fall through on a non-digit.
; ESI points at the first non-digit
; EBX:ECX:EDX:EAX holds the 128-bit binary integer.
You could move the total += digit
up to before the reload of total*2
to better hide store-forwarding latency.
Another possible option is 4x mul
and the requisite add
/adc
of the partial products. That might be nice if you can assume BMI2 for mulx
to multiply without affecting flags so you can interleave mulx with adc. But then you'd need 10
in a register.
Another option is to use XMM registers for SSE2 64-bit integer math. Or MMX for 64-bit MMX regs. Dealing with 64-bit-element boundaries is inconvenient, though, because only scalar integer has add-with-carry. But possibly still worth it because you only have half the number of operations.
It might be better to convert 9-digit groups of integers to 32-bit decimal, then do extended-precision multiplies by 1e9 to combine. (Like the last 9 digits, the 9 digits before that, etc.) So you don't have all this adc / store+reload work for every digit. That would mean a significant amount of multiplying at the end to combine up to four(?) groups of digits.
Or maybe just process the first 9 digits with a single register (the normal way), then widen to two registers with a 2nd loop, then widen to four for digits after the 18th. That would be good for numbers that turn out to be shorter than 9 digits, only ever using the fast 1-register accumulator.