2

I'm hoping this is going to be my final question on here relating to this subject!

I'm looking for a way to convert huge decimal numbers that are encoded as ASCII into their 128bit hexadecimal (binary) representation.

These are actually IPv6 addresses presented in their decimal notation.

for example: "55844105986793442773355413541572575232" resolves to: 0x2a032f00000000000000000000000000

The majority of my code is in x86-32 MASM assembly, so I'd rather keep it this way than chopping between different languages.

I've got code that works in python, but as above, I'd like to have everything in x86 asm.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
colinr
  • 41
  • 3
  • binary -> hex is easy; each byte converts separately to two hex digits. (Or each dword to 8 hex digits). See [How to convert a number to hex?](//stackoverflow.com/q/53823756) for a simple loop, and for efficient ways using SSE2, SSSE3, AVX2, AVX512F, or AVX512VBMI to convert 64 bits of input at a time into 16 bytes of hex, or with AVX2 even do your whole 128-bit input in one step. – Peter Cordes Jul 21 '19 at 13:19
  • Are you writing entirely assembler code or are you calling the assembly code from a higher level language like C/C++/Python? – Michael Petch Jul 21 '19 at 16:30
  • @MichaelPetch, I'm entirely using assembly (WinASM Studio), or at least trying to do so! My proof of concept was in Python – colinr Jul 21 '19 at 17:41
  • I work in incident response and I've written a tool to parse Office 365 audit logs (done and working). Additionally to that I'm now building a data set to cross reference IP addresses against countries and proxies. IPv4 was easy, it's just the V6 addressed that are causing me a headache, so much so, after reading the other comments and code on here I had to have a little sleep! – colinr Jul 21 '19 at 17:49
  • IDK why you'd have an IPv6 address as a single decimal integer in the first place, or why you'd be parsing it with asm. Especially 32-bit mode asm. If you're developing new tools, why not have them run in 64-bit mode where you only need 2 registers, and SSE2 support is guaranteed? And you can usually get C compiler (like gcc) to work with 128-bit integers. – Peter Cordes Jul 21 '19 at 20:38
  • Because what started out as IPv4 only manifested into needing support for IPv6 later, Taking into consideration that casting in C++ is a PITA, seemed at the time the way to go. Obviously, I'm now having doubts. – colinr Jul 21 '19 at 21:19

2 Answers2

3

This is 2 parts - converting "decimal ASCII" to 128-bit unsigned integer; then converting 128-bit unsigned integer to "hex ASCII".

The first part is like:

set result to zero
for each character:
    if character not valid handle "invalid character error" somehow
    else
        if result is larger than "max/10" handle "overflow error" somehow
        result = result * 10
        digit = character - '0'
        if result is larger than "max - digit" handle "overflow error" somehow
        result = result + digit

For this you'll need code to multiply 128-bit integers by 10, compare two 128-bit integers, subtract a byte from a 128-bit integer, and add a byte to a 128-bit integer. For multiplying by 10; it can be (and should be) implemented as "x = (x << 3) + (x << 1)"; so it can be thought of as left shift and addition.

Note: I'll assume 32-bit 80x86 (based on your previous questions). I'll also use NASM syntax (because I'm "less familiar" with MASM syntax) but it should be easy enough to convert to MASM syntax

Left shift; you'd split the 128-bit integer into 4 (32 bit) pieces and use something like:

    ;esi = address of source number
    ;edi = address of destination number
    ;cl = shift count

    mov edx,[esi+12]
    mov eax,[esi+8]
    shld edx,eax,cl
    mov [edi+12],edx
    mov edx,eax
    mov eax,[esi+4]
    shld edx,eax,cl
    mov [edi+8],edx
    mov edx,eax
    mov eax,[esi]
    shld edx,eax,cl
    mov [edi+4],edx
    shl eax,cl
    mov [edi],eax

For addition of two 128-bit numbers:

    ;esi = address of first source number
    ;edi = address of second source number and destination

    mov eax,[esi]
    add [edi],eax
    mov eax,[esi+4]
    adc [edi+4],eax
    mov eax,[esi+8]
    adc [edi+8],eax
    mov eax,[esi+12]
    adc [edi+12],eax

For addition of a dword (zero extended byte) to a 128-bit number:

    ;eax = first number
    ;edi = address of second number and destination

    add [edi],eax
    adc dword [edi+4],0
    adc dword [edi+8],0
    adc dword [edi+12],0

For subtraction of a dword (zero extended byte) from a 128-bit number:

    ;eax = first number
    ;edi = address of second number and destination

    sub [edi],eax
    sbb dword [edi+4],0
    sbb dword [edi+8],0
    sbb dword [edi+12],0

For comparing 128 bit integers:

    ;esi = address of first source number
    ;edi = address of second source number

    mov eax,[esi+12]
    cmp [edi+12],eax
    jb .smaller
    ja .larger
    mov eax,[esi+8]
    cmp [edi+8],eax
    jb .smaller
    ja .larger
    mov eax,[esi+4]
    cmp [edi+4],eax
    jb .smaller
    ja .larger
    mov eax,[esi]
    cmp [edi],eax
    jb .smaller
    ja .larger
    mov al,0         ;Values are equal
    ret

.smaller:
    mov al,-1        ;First value is smaller than second
    ret

.larger:
    mov al,1         ;First value is larger than second
    ret

The second part (converting to hex ASCII) is fairly trivial - mostly just a "for each byte from highest to lowest; convert byte to 2 hex characters (possibly using a lookup table)" thing. You should be able to find code to do this easily, so I won't describe it here.

Brendan
  • 26,293
  • 1
  • 28
  • 50
  • `shld` by `cl` is much less efficient on current Intel than `shld r,r, 1` or `3`. Like 4 uops instead of 1. (Even so, `shld` immediate still has 3-cycle latency and only 1-per-clock throughput, unfortunately.) So it's tempting to do the shift-by-1 with an `add`/3x`adc` chain, especially if you don't care about Intel before Haswell. – Peter Cordes Jul 21 '19 at 13:44
  • 1
    @PeterCordes: The goal (especially for beginners) is the ability to understand and maintain the code, not performance. – Brendan Jul 21 '19 at 13:47
  • 2
    You're spending a lot of registers on pointers. I'd address tmp buffers relative to ESP to leave at least 4 registers for holding my 128-bit total. Spilling during calculation of `total * 10` seems unavoidable unless maybe we use `mulx` or `mul`. But you only need to spill 1 copy of it, e.g. shift by 1 in 4x registers, then spill that and shift your registers by another 2, then add / adc from memory. Leaves you room to keep an input string pointer in another register. – Peter Cordes Jul 21 '19 at 13:50
  • I added code examples to my answer. I think spilling `total*2` leads to very readable and understandable code, as well as being efficient. – Peter Cordes Jul 21 '19 at 14:30
3

Hex is an ASCII serialization format for binary. You're going to want to first convert from ASCII-decimal to binary integers in registers. Then convert that binary to hex. Hex != binary.


binary -> hex is easy; each binary byte converts separately to two ASCII hex digits. (Or each dword to 8 hex digits). See How to convert a binary integer number to a hex string? for a simple loop, and for efficient ways using SSE2, SSSE3, AVX2, AVX512F, or AVX512VBMI to convert 64 bits of input at a time into 16 bytes of hex, or with AVX2 even do your whole 128-bit / 16-byte input in one step and produce all 32 bytes of hex digits.


That just leaves the decimal-ASCII -> unsigned __int128 input problem. 128-bit shift with shld/.../shl (starting with the high dword) and add with add/adc/adc/adc (starting with the low dword) are straightforward, so you can implement the usual total = total * 10 + digit (NASM Assembly convert input to integer?) but with extended-precision 128-bit integer math. It takes 4x 32-bit registers to hold a 128-bit integer.

Implement t*10 as t*2 + t*8 = (t*2) + (t*2)*4 by first doubling using either 3x shld and add eax,eax, or add eax,eax + 3x adc same,same. Then copy and shift by another 2, then add the two 128-bit numbers together.

But with only 7 GP integer registers (not counting the stack pointer), you'd have to spill something to memory. And you also want your string input pointer in a register.

So probably you'd want to left-shift by 1 in your 4x registers, then spill them to memory and shift by another 2 in registers. Then add/3xadc from the stack buffer where you spilled them. That lets you multiply a 128-bit integer in 4 regs by 10 without using any extra registers.

    ; input:  total = 128-bit integer in  EBX:ECX:EDX:EAX
     ; 16-byte tmp buffer at [esp]
    ; result: total *= 10  in-place
    ; clobbers: none

    ; it's traditional to keep a 64-bit integer in EDX:EAX, e.g. for div or from mul
    ; I chose EBX:ECX for the high half so it makes an easy-to-remember pattern.

;;; total *= 2  and copy to tmp buf
    add   eax, eax             ; start from the low element for carry propagation
    mov   [esp + 0], eax
    adc   edx, edx
    mov   [esp + 4], edx
    adc   ecx, ecx
    mov   [esp + 8], ecx
    adc   ebx, ebx
    mov   [esp + 12], ebx

;;; shift that result another 2 to get   total * 8
    shld  ebx, ecx, 2        ; start from the high element to pull in unmodified lower bits
    shld  ecx, edx, 2
    shld  edx, eax, 2
    shl   eax, 2

;;; add total*2 from memory to total*8 in regs to get  total*10
    add   eax, [esp + 0]
    adc   edx, [esp + 4]
    adc   ecx, [esp + 8]
    adc   ebx, [esp + 12]

Out-of-order execution is very helpful here. Notice that in the shld block, the instructions don't actually depend on the previous shld. They pull in bits from unmodified lower elements. As soon as the first add eax,eax runs, shl eax,2 can run (if the front-end has already issued it).

Register renaming makes it possible to run that SHL without stalling for a WAR (Write-after-read) hazard. The shld edx, eax, 2 also needs EAX as an input, but the whole point of register renaming is to let the CPU track that version of EAX separately from the output of the shl eax,2.

This lets us write code that doesn't use many architectural registers (just these 4), but still takes advantage of more physical registers to let the shld/shl block execute in the opposite order from program order, as inputs become ready from the add/adc block.

This is great because it means that the final add/adc block (adding from memory) has its inputs ready in the order it needs them, without serializing the latencies of either chain of instructions. This is good because shld has 3 cycle latency on current Intel CPUs (like Haswell/Skylake), up from 1 on Sandybridge/IvyBridge. (It was a 2-uop instruction with 2c latency on Nehalem and earlier). But on Haswell/Skylake it's still 1 uop with 1-per-clock throughput. (port 1 only)

Ryzen has slower shld: 6 uops, 3 cycle latency, one-per-3-cycle throughput. (https://agner.org/optimize/)

We effectively can have 3 add or shift chains in flight at once even though in program order each block is done separately. And once we add in the new digit with a 4th block, it can be in flight, too.

Example loop. Enter it with EBX:ECX:EDX = 0 and EAX = the first digit, ready to check the 2nd character for being a digit and then do total = t*10 + digit.

.digit_loop:
    ... earlier block    ; total *= 10

    add    eax, ebp      ; total += digit
    adc    edx, 0
    adc    ecx, 0
    adc    ebx, 0

.loop_entry_point:
    inc    esi
    movzx  ebp, byte ptr [esi]    ; load a new input digit

    sub    ebp, '0'               ; ASCII digit -> 0..9 integer
    cmp    ebp, 9                 ; unless it was out of range
    jbe   .digit_loop
;else fall through on a non-digit.

; ESI points at the first non-digit
; EBX:ECX:EDX:EAX holds the 128-bit binary integer.

You could move the total += digit up to before the reload of total*2 to better hide store-forwarding latency.


Another possible option is 4x mul and the requisite add/adc of the partial products. That might be nice if you can assume BMI2 for mulx to multiply without affecting flags so you can interleave mulx with adc. But then you'd need 10 in a register.

Another option is to use XMM registers for SSE2 64-bit integer math. Or MMX for 64-bit MMX regs. Dealing with 64-bit-element boundaries is inconvenient, though, because only scalar integer has add-with-carry. But possibly still worth it because you only have half the number of operations.


It might be better to convert 9-digit groups of integers to 32-bit decimal, then do extended-precision multiplies by 1e9 to combine. (Like the last 9 digits, the 9 digits before that, etc.) So you don't have all this adc / store+reload work for every digit. That would mean a significant amount of multiplying at the end to combine up to four(?) groups of digits.

Or maybe just process the first 9 digits with a single register (the normal way), then widen to two registers with a 2nd loop, then widen to four for digits after the 18th. That would be good for numbers that turn out to be shorter than 9 digits, only ever using the fast 1-register accumulator.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Thanks chaps. I'm looking at the code but I can't seem to figure out where the conversion from ASCII decimal to 128 bit raw binary (hex) takes place. It's probably worth mentioning that the source string can be of variable length, but when converting to hex, the resulting hex number will not exceed 128 bits. the result won't be saved as ACSCII, so there is no need for further conversion bacjk to ASCII, as Brandon said, that is trivial anyway. – colinr Jul 21 '19 at 15:27
  • Brendan, not Brandon, Sorry! – colinr Jul 21 '19 at 15:34
  • 1
    @colinr : Hex is a serialization format for binary integers. You said you wanted hex. But anyway, the loop in my code ends on reaching a non-digit, with the 128-bit binary integer in registers. It doesn't attempt overflow-detection in the `t = t*10 + digit` loop, so it's good that the number can't exceed 128-bit. I updated my answer to comment the end of the loop. If you did want to turn that into hex, you'd do that after. If you didn't understand the basics of the string -> integer loop, read the linked question that explains the basics for the simple 32-bit case. – Peter Cordes Jul 21 '19 at 15:46
  • "shrld" is a mistake. – ecm May 16 '21 at 17:31
  • @ecm: thanks, fixed. – Peter Cordes May 16 '21 at 17:36