0

I'm trying to encode a binary file into base64. Althrough, I'm stuck at the few steps and I'm also not sure if this is the way to think, see commentaries in code below :

SECTION .bss            ; Section containing uninitialized data

    BUFFLEN equ 6       ; We read the file 6 bytes at a time
    Buff:   resb BUFFLEN    ; Text buffer itself

SECTION .data           ; Section containing initialised data

    B64Str: db "000000"
    B64LEN equ $-B64Str

    Base64: db "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

SECTION .text           ; Section containing code

global  _start          ; Linker needs this to find the entry point!

_start: 
    nop         ; This no-op keeps gdb happy...

; Read a buffer full of text from stdin:
Read:
    mov eax,3       ; Specify sys_read call
    mov ebx,0       ; Specify File Descriptor 0: Standard Input
    mov ecx,Buff        ; Pass offset of the buffer to read to
    mov edx,BUFFLEN     ; Pass number of bytes to read at one pass
    int 80h         ; Call sys_read to fill the buffer
    mov ebp,eax     ; Save # of bytes read from file for later
    cmp eax,0       ; If eax=0, sys_read reached EOF on stdin
    je Done         ; Jump If Equal (to 0, from compare)

; Set up the registers for the process buffer step:
    mov esi,Buff        ; Place address of file buffer into esi
    mov edi,B64Str      ; Place address of line string into edi
    xor ecx,ecx     ; Clear line string pointer to 0


;;;;;;
  GET 6 bits from input
;;;;;;


;;;;;;
  Convert to B64 char
;;;;;;

;;;;;;
  Print the char
;;;;;;

;;;;;;
  process to the next 6 bits
;;;;;;


; All done! Let's end this party:
Done:
    mov eax,1       ; Code for Exit Syscall
    mov ebx,0       ; Return a code of zero 
    int 80H         ; Make kernel call

So, in text, it should do that :

1) Hex value :

7C AA 78

2) Binary value :

0111 1100 1010 1010 0111 1000

3) Groups in 6 bits :

011111 001010 101001 111000

4) Convert to numbers :

31 10 41 56

5) Each number is a letter, number or symbol :

31 = f
10 = K
41 = p
56 = 4

So, final output is : fKp4

So, my questions are : How to get the 6 bits and how to convert those bits in char ?

Some programmer dude
  • 363,249
  • 31
  • 351
  • 550
  • AND with `0x3F` to get the low 6 bits as an integer you can use as an array index. `shr` by 6 to get the next 6 bits down to the bottom. – Peter Cordes Dec 11 '17 at 14:44
  • It's not clear how do you want to read binary input from stdin, and also how do you want to loop till `` and remember to pad your output correctly. Although that's not your question, and not really important. – Ped7g Dec 11 '17 at 14:45
  • And BTW, if the input is larger than one register, you'll need to use an extended-precision shift. I guess work in chunks for `lcm(32, 6)` bits so you don't need to do an extended-precision shift of the entire bitstring. – Peter Cordes Dec 11 '17 at 14:51
  • 1
    @Ped7g: Reading binary from stdin is trivial: `./encodeb64 < binary_file`. Let the shell take care of making the `open()` system call for you. – Peter Cordes Dec 11 '17 at 14:51
  • And if this is for 64b linux, then you shouldn't use `int 0x80`, but `syscall` (the services have different arguments, so you must rewrite each call a bit). The code in my answer will stay for 64b almost intact, only the memory access should be rather `[rsi], [rdi]` and the pointer advancing `add`s should use of course `rsi/rdi` too. And the `[Base64+...]` lines should use rather `r..` variant (the value will be zero-extended to 64b in each case). Makes me wonder how you did end with `int 0x80` in the first place, if you have some 32b tutorial, then rather switch to building 32b binary. – Ped7g Dec 11 '17 at 15:38

1 Answers1

1

EDIT after few years:

Lately somebody did run into this example, and while discussing how it works and how to convert it to x64 for 64b linux, I turned it into fully working example, source available here: https://gist.github.com/ped7g/c96a7eec86f9b090d0f33ba36af056c1


You have two major ways how to implement it, either by generic loop capable to pick any 6 bits, or by having fixed code dealing with 24 bits (3 bytes) of input (will produce exactly 4 base64 characters and end at byte-boundary, so you can read next 24bits from +3 offset).

Let's say you have esi pointing into source binary data, which are padded enough with zeroes to make abundant memory access beyond input buffer safe (+3 bytes at worst case).

And edi pointing to some output buffer (having at least ((input_length+2)/3*4) bytes, maybe with some padding as B64 requires for ending sequence).

; convert 3 bytes of input into four B64 characters of output
mov   eax,[esi]  ; read 3 bytes of input
      ; (reads actually 4B, 1 will be ignored)
add   esi,3      ; advance pointer to next input chunk
bswap eax        ; first input byte as MSB of eax
shr   eax,8      ; throw away the 1 junk byte (LSB after bswap)
; produce 4 base64 characters backward (last group of 6b is converted first)
; (to make the logic of 6b group extraction simple: "shr eax,6 + and 0x3F)
mov   edx,eax    ; get copy of last 6 bits
shr   eax,6      ; throw away 6bits being processed already
and   edx,0x3F   ; keep only last 6 bits
mov   bh,[Base64+edx]  ; convert 0-63 value into B64 character (4th)
mov   edx,eax    ; get copy of next 6 bits
shr   eax,6      ; throw away 6bits being processed already
and   edx,0x3F   ; keep only last 6 bits
mov   bl,[Base64+edx]  ; convert 0-63 value into B64 character (3rd)
shl   ebx,16     ; make room in ebx for next character (4+3 in upper 32b)
mov   edx,eax    ; get copy of next 6 bits
shr   eax,6      ; throw away 6bits being processed already
and   edx,0x3F   ; keep only last 6 bits
mov   bh,[Base64+edx]  ; convert 0-63 value into B64 character (2nd)
; here eax contains exactly only 6 bits (zero extended to 32b)
mov   bl,[Base64+eax]  ; convert 0-63 value into B64 character (1st)
mov   [edi],ebx  ; store four B64 characters as output
add   edi,4      ; advance output pointer

After the last group of 3B input you must overwrite last output with proper amount of '=' to fix the fake zeroes outputted. I.e. input 1B (needs 8 bits, 2x B64 chars) => output ends with '==', 2B input (needs 16b, 3x B64 char) => ends '=', 3B input => full 24bits used => valid 4x B64 char.

If you don't want to read whole file into memory and produce whole output buffer in memory, you can make the in/out buffer of limited length, like only 900B input -> 1200B output, and process input in 900B blocks. Or you can use 3B -> 4B in/out buffer, then remove the pointer advancing completely (or even esi/edi usage, and use fixed memory), as you will have to load/store in/out for every iteration separately then.

Disclaimer: this code is written to be straightforward, not performant, as you asked how to extract 6 bits and how to convert value into character, so I guess staying with the basic x86 asm instructions is best.

I'm not even sure how to make it perform better without profiling the code for bottlenecks and experimenting with other variants. Surely the partial register usage (bh, bl vs ebx) will be costly, so there's very likely better solution (or maybe even some SIMD optimized version for larger input block).

And I didn't debug that code, just written in here in answer, so proceed with caution and check in debugger how/if it works.

Ped7g
  • 15,245
  • 2
  • 24
  • 50
  • On modern CPUs, partial registers are probably a good approach, but the optimal strategy would depend on AMD (no partial register renaming) vs. Intel SnB-family (reading EBX after writing BH takes a merging uop that has to issue in a cycle by itself). On Intel, loading into another reg (with `movzx`) and using `shrd` to shift a byte into an accumulator is probably good. – Peter Cordes Dec 11 '17 at 15:47
  • The Base64 lookup table has some big ranges, but with `+` and `/` not even being contiguous with each other, it's probably not a win to SIMD it with branchless compares and blends. But maybe you can build something out of `pshufb` as a 4-bit LUT, and do the other 2 bits somehow? Or just `pshufb` on 4 different tables and blend the result according to the last 2 bits. [AVX512VBMI `vpermb`](https://github.com/HJLebbink/asm-dude/wiki/VPERMB) (Cannonlake) would do the trick, though: 64 byte-lookups in parallel from a 64-byte table. Totally worth it even if it decodes to multiple uops. – Peter Cordes Dec 11 '17 at 15:53
  • SIMD to separate 3-byte chunks into 4-byte groups of 6-bit indices is tricky, too. `pshufb` to group them, and some shifts / blends? – Peter Cordes Dec 11 '17 at 16:00
  • @PeterCordes would have to debug it first to know it works, and resolve ending `=`, before venturing into optimizations. Also I guess I'm not the first entity writing base64 encoder, so there's probably quite some work from others to start upon. Was a bit fun to write it the "naive" way in few minutes, but I would do lot more research if I would want to have it in production. Before that, I would rather hear from OP that 1) it works 2) he understand how/why it works, eventually some questions/suggestions to make this better answer for newcomers. (+I'm nowhere near your SIMD knowledge :) ) – Ped7g Dec 11 '17 at 16:26
  • I'm sure if we went looking at existing implementations there'd be some neat tricks. Probably somebody's come up with something better than a scalar loop, but like you it I thought it was more fun to just make some thing up since I don't actually need a high-perf Base64 implementation. And BTW, there's a canonical Q&A about how writing partial regs performs on different CPUs: https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers. – Peter Cordes Dec 11 '17 at 16:32
  • To speed up your version with BMI2, you could use `rorx` to copy+rotate (by an immediate) instead of shifting. BSWAP wasn't needed either: copy+shift+mask instead of shift+copy+mask also shortens the dep chain (although each group is 3 bytes is independent). – Peter Cordes Dec 11 '17 at 16:34
  • @PeterCordes: how do you get correct bit order without `bswap` (and without composing intermediate results/masks from different shifts)? I'm probably a bit slow today, but the first character is using b2-b7, second b8-b11:b0-b1, third is b16-b17:b12-b15, fourth b18-b23 ... looks to me like 2 times more shifting and masking without `bswap`. – Ped7g Dec 11 '17 at 16:42
  • Thank you so much !! I'll stick to the "basic instructions" for now. Thank you again! –  Dec 11 '17 at 17:34
  • @Ped7g: ah, you're right about `bswap`, I didn't bother to look up how base64 worked (big endian), or think through the fact that `bswap` would have an effect on grabbing bits across byte boundaries. Oh, and if BMI2 is available, `bswap` + `pextr` with 3 different masks is the obvious choice for Intel CPUs. (It's horribly slow on AMD, so RORX + AND there.) – Peter Cordes Dec 13 '17 at 14:26
  • Oh, and with AVX512VBMI (Cannonlake), `VPMULTISHIFTQB` is fantastic. You'd need a byte shuffle like `vpermb` to do a lane-crossing 3->4 expansion and put the bytes you need into each qword in the right order. But then I think [`vpmultishiftqb`](https://github.com/HJLebbink/asm-dude/wiki/VPMULTISHIFTQB) + `vpand` can extract the right bits into each byte from somewhere in the qword that contains it. And then that sets you up for an AVX512VBMI `vpermb` shuffle as a 64-way parallel 64-byte LUT. If `vpermb` takes 3 shuffle uops (like `vpermw` on SKX), then maybe 64 result bytes / 6c – Peter Cordes Dec 13 '17 at 14:33
  • 1
    For the record, I searched and found [an interested AVX512F implementation](http://0x80.pl/articles/avx512-foundation-base64.html). Without byte shuffles (or byte elements for *anything* else, even ADD), it uses a DWORD gather to get 3-byte windows of data and some SWAR tricks to handle within-element stuff. I haven't figured out how they handle the endian swap, or if their data is already big-endian or something; I'm not interested enough to really read it :P Same guy has [SSE and AVX on github](https://github.com/WojciechMula/base64simd), claimed speedup 2 to 4x for encode, 2 to 2.7 decode – Peter Cordes Dec 13 '17 at 14:45