1

As the question suggests, I have to write a MASM program to convert an integer to binary. I have tried many different approaches, but none of them helped me at all. The final code I'm working on is as follows. I get an access memory violation error when I debug my code in Visual Studio.

Any help on how to solve the error and if I'm on the right track or not will be greatly appreciated. The first code is my C++ code which passes a char array to an .asm file to be converted to binary.

#include <iostream>
using namespace std;
extern "C"
{
  int intToBin(char*);
}

int main()
{
  char str[17] = { NULL };
  for (int i = 0; i < 16; i++)
  {
    str[i] = '0';
  }

  cout << "Please enter an integer number :";
  cin >>str;
  intToBin(str);
  cout << " the equivilant binaryis: " << str << endl;
  return 0;
}

and the .asm file is the following:

.686
.model small
.code 

_intToBin PROC       ;name of fucntion
  start:    

    push ebp ; save base pointer
    mov ebp, esp ; establish stack frame

    mov eax, [ebp+8] ; stroing char value into eax
    mov ebx, [ebp+12]; adress offset of char array
    mov edx,32768 ;storin max 16bit binary in edx
    mov ecx,17  ; since its a 16 bit , we do the loop 17 times


  nextBite:
    test eax,edx        ;testing if eax is equal to edx
    jz storeZero        ;if it is 0 is to be moved into bl

    mov bl,'1'          ;if not 1 is moved into bl
    jmp storeAscBit     ;then jump to store ascii bit

  storeZero:
    mov bl,'0'          ;moving 0 into bl register

  storeAscBit:
    mov [di ],bl        ;moving bl (either 1 or 9) into [di]
    inc edx             ;increasing edx stack by 1 point to go to next bt
    shr edx,1           ;shfiting right 1 time so the 1 comes to second      
    loop nextBite       ; do the whole step again

  EndifReach:   
    pop ebp
_intToBin ENDP
 END
Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
ArashDe
  • 23
  • 1
  • 6
  • http://stackoverflow.com/q/40769766/4271923 – Ped7g Nov 25 '16 at 19:39
  • 2
    reading through your code, your problem is a bit deeper, not only bugged code, but you don't even understand the task... oh well. – Ped7g Nov 25 '16 at 19:41
  • 1
    First thing... can you show some example, what is the wanted input, and what is the correct output? From your question and code it's not clear, which end is which, it's confused on so many places, that the original task may have been as well as "take decimal integer from user, output 16b binary format", or "take 16 bit binary format from user, output decimal format" ... which one it is? – Ped7g Nov 25 '16 at 19:48
  • sorry of my explanation is confusing , English is second language, i have to "take decimal integer from user, output 16b binary format" – ArashDe Nov 25 '16 at 19:56
  • for example: if the input is 33 my output should be this : 0000 0000 0010 0001 One of the problems i have , is no matter what approach i use , i always get writing access memory violation error when i to : mov [di ],bl – ArashDe Nov 25 '16 at 19:58
  • sadly yes , my C++ program should pass a char array and the integer to be converted to an assembly function and then display the char array which contains the equiviliant binary of input decimal – ArashDe Nov 25 '16 at 20:05
  • I suggest you solve the problem in the higher level language first. I am properly confused as to why you need 17-bit counters and masks when working with 16 bits. Solve the problem with `unsigned short` or `uint16_t`, then translate to assembler. – Weather Vane Nov 25 '16 at 20:05
  • I think the 17-bit container is saving one space for negative numbers – ArashDe Nov 25 '16 at 20:07
  • there is no restriction in using commands such as atoi. but i have never used atoi so i don't know how it works in masm – ArashDe Nov 25 '16 at 20:08
  • A negative 16-bit number still requires 16 bits. A binary pattern is what it is, regardless of what it *represents*. – Weather Vane Nov 25 '16 at 20:08
  • So i guess my teacher is wrong about using 17 bit in char array – ArashDe Nov 25 '16 at 20:11
  • 1
    He probably reserved one byte for null terminator, so the result will be valid C "string". – Ped7g Nov 25 '16 at 20:12
  • i think that may be the reason he chose 17 – ArashDe Nov 25 '16 at 20:28
  • ok im going to try and write it , is it possible to convert the string to number and then convert the number to binary in one .asm file ? – ArashDe Nov 25 '16 at 20:35
  • No, you convert the string to number with `atoi`, then send the number to assembly to convert to binary. – Jose Manuel Abarca Rodríguez Nov 25 '16 at 20:36
  • About ` mov [di ],bl ` crashing...of course, you don't set `edi` to any value, and also you use only 16b of `edi` to address memory while you are in 32b mode, so it very likely tried to modify memory protected by OS. Also you read something from `[ebp+12]`, while your function has only one argument, etc... But already your C++ code is wrong, so maybe you should try first to write it in C++ whole, and make sure you understand what is what and why. Then get back to asm, slowly adding parts of it, debugging each of them to verify it works as expected (there are **many** bugs in that asm) – Ped7g Nov 25 '16 at 20:41

3 Answers3

2

This is high-level answer to explain some terms.

Part 1 - about integer numbers and their encoding in computer

Integer value is integer value, in math it's purely abstract thing. Number "5" is not what you see on the monitor (that's digit 5 (graphical image or "glyph") representing value 5 in base-10 (decimal) format for humans (and some trained animals) who can recognize that glyph pattern; the value 5 itself is purely abstract).

When you use int in C++, it's not completely abstract, it's lot more hard-wired into the metal. It's 32 bit (on most of the platforms) integer value.

But still that abstract description is much closer to truth, than imagining it as human decimal format of it.

int a = 12345; // decimal number

Here a contains value 12345, not the format. It's not aware it was entered as decimal string in the source code.

int a = 0x3039; // hexadecimal number

will compile into the exactly same machine code, for CPU it's the same thing, still (a == 12345). And finally:

int a = 0b0011000000111001; // binary number

is again the same thing. It's still the same 12345 value, just written in different formatting.

The last binary form is closest to what CPU is using to store the value. It is stored in 32 bits (low/high voltage cells/wires), so if you would measure voltage on particular cell/wire, you would see the "0" voltage level on top 18 bits, then 2 bits with "1" voltage level, and then the rest like in that binary format above... With two least significant bits being "0" and "1".

Also most of CPU circuitry is not aware of particular value of particular bit, that's again "interpretation" of that 0/1, done by the code. Many CPU algorithms like add or sub work "from right to left" over all bits, not being aware that currently processed bit is representing in final integer value for example 213 value (that's the 14th least significant bit).

It's upon taking those bits, and calculating string with decimal/hexadecimal/binary representation of those bit values, when you give those "1"s their value. So then it becomes text "12345".

If you treat those 32 bits in different way, like representation of ON/OFF LED lights for a LED display panel, then so it will be, once you send it from CPU to the display, the LED display panel will turn on corresponding LED lights, not caring that those bits form also 12345 value when treated as int.

Only very few CPU instructions work in a way, where they need to be aware of particular value of particular bit.

Part 2 - about input, output and arguments of C/C++ functions

You want to "convert decimal integer (input) to binary."

So let's reason what is input and what is output. Input is taken from std::cin, so the user will enter string.

Yet if you will do:

int inputNum;
std::cin >> inputNum;

You will end with already converted integer value (32 bits, see above) (or invalid std::cin state, when user will not enter correct number, probably not your task to handle this).

If you have the number in int, the binary conversion was already done by the clib, when it was encoding user input string as 32 bit integer.

Now you can create asm function with C prototype:

void formatToBinary(uint16_t value, char result[17]);

That means you will give it uint16_t (unsigned 16 bit) integer value, and pointer to 17 reserved bytes in memory, where you will write '0' and '1' ASCII characters, and terminate it by another 0 value (for rough description of this one follow my first link in comments under your question).

If you must take input as string, ie.

char str[17];
std::cin > str;

Then you will have in str (after "12345" input) bytes with values: '1' (49 in decimal), '2', '3', '4', '5', 0. (Note the last one is zero, NOT ASCII digit '0' = value 48).

You will need first to convert these ASCII bytes into integer value (in C++ atoi may help, or one of few other functions for conversions/formatting). In ASM check SO for questions "how to enter integer".

Once you will convert it to integer value, you can proceed the same way as described a bit above (at that moment it's already encoded in 16 or 32 bits, so outputting string representation of it should be easy).

You may still run into some tricky parts, like if you don't want to output leading zeroes, etc... but all of that should be easy, if you understand how this works.

In this case your ASM function prototype may be only void convertToBinary(char*); to reuse the string pointer both as input, and output.

Your int intToBin(char*); looks weird, because it means the ASM will return int .. but why? That's integer value, not bonded into any particular formatting, so it's binary/octal/decimal/hexa at the same time. Depends how you display it. So you don't need it, you need only the string representing the value in binary form, that's that char *. And you don't give it the number you entered (unless it's taking it from the string).


From the task description and your skill level I think you are allowed to convert the input into int right in C++ (ie. std::cin >> int_variable;).


BTW, if you fully understand what is happening to values in computer, and how CPU instruction work over them, you can often come with many different ways how to achieve some result. For example Jose's conversion to binary is written in simple way how an Assembly newcomer would write it (he wrote it like that to make it easier for you to understand):

           mov eax, num   // ◄■■ THE NUMBER.
           lea edi, bin   // ◄■■ POINT TO VARIABLE "BIN".
           mov ecx, 32    // ◄■■ NUMBER IS 32 BITS.
        conversion:
            shl eax, 1     // ◄■■ GET LEFTMOST BIT.
            jc  bit1       // ◄■■ IF EXTRACTED BIT == 1
            mov [edi], '0'
            jmp skip
        bit1:
            mov [edi], '1'
        skip :
            inc edi   // ◄■■ NEXT POSITION IN "BIN".
            loop conversion

It's still a bit fragile, for example he initializes "bin" in such way, that it contains 32 spaces and 33th value is zero (null terminator of C string). Then in code he does modify exactly 32 bytes, so the 33th zero is still there and working. If you would adjust his code to skip leading zeroes, it would "break" by displaying remaining part of buffer, as he doesn't set null terminator explicitly.

This is common way how to code in Assembly for performance, to be exactly aware of everything happening, and not setting values which are already set/etc. While you are learning, I would suggest you to work in "defensive" way, rather doing some wasteful things, which will work as safety net in case of some mistake, so I would add mov byte ptr [edi],0 after loop to set terminator explicitly again.

But it is actually not very fast, as it is using branching. CPU doesn't like that, decoding new instructions is a costly task, and if it is not sure, which instructions will be executed, it simply decodes ahead one path, and in case of wrong guess, it will throw it out, and decode the correct path, but that means several cycles pause in execution, until first instruction of new path is fully decoded and ready for execution.

So when coding for performance, you want to avoid hard-to-predict branches (the final loop is easy to predict for CPU, as it always loops, only until final exit after ecx is 0). One of many possible ways in this case can be:

   mov edx, num
   lea edi, bin
   mov ah,'0'/2   // for fast init of al later
   // '0' is 48 (even), '0'/2 will work (24)
   mov ecx, 32    // countdown counter
conversion:
   mov al,ah      // al = '0'/2
   shl edx, 1     // most significant bit into CF
   adc al,al      // al = '0'/2 + '0'/2 + CF = '0' or '1'
   stosb          // store the '0' or '1' to [edi++]
   dec ecx        // manually written "loop"
   jnz conversion // (it is faster on modern CPUs)
   mov [edi],ch   // explicit set of null-terminator
       // (ch == 0, because here ecx == 0)

As you can see, now there is no branching except the loop, CPU branch prediction will handle this much more smoothly, and the performance will be considerably better.


A dword variant for discussion with Cody (NASM syntax, 32b target):

; .data
binNumber   times 36 db 0

; .text
numberToBin:
    mov     edx,0x12345678
    lea     edi,[binNumber]
    mov     ecx, 32/4       ; countdown counter
n2b_conversion:
    mov     eax,0b11000000110000001100000011000
      ; ^ will become '0'/'1' for each of four bits
    shl     edx,1
    rcr     eax,8
    shl     edx,1
    rcr     eax,8
    shl     edx,1
    rcr     eax,8
    shl     edx,1
    rcr     eax,8
      ; here was "or eax,'0000'" => no more needed.
    stosd
    dec     ecx
    jnz     n2b_conversion
    mov     [edi],dl        ; null terminator
    ret

Didn't profile it, just verified it return correct result.

Ped7g
  • 15,245
  • 2
  • 24
  • 50
  • 1
    Thanks man , greatly appreciate it , i will return to coding and try to fix it as you explained to me – ArashDe Nov 25 '16 at 21:25
  • @ArashDe I added example how differently things can be done in Assembly, when you fully understand the data and instructions (I modified Jose's loop to show you some other concepts). If you don't understand something, try to re-read it few more times, then ask. Or let me know, if everything was clear and you understood it all (I'm worried it's not that simple, to follow me :) ). – Ped7g Nov 25 '16 at 23:03
  • Thank you for explaining clearly and thoroughly , i understand everything way better now . i like how you used stosb , i had no idea i could have used that – ArashDe Nov 26 '16 at 03:45
  • 1
    It is rather ironic that you say "coding for performance", and then list code that uses `STOSB`. Without a REP prefix, STOSB is almost completely useless unless you're targeting the 8088 where the extremely slow memory bus and prefetcher actually means that a 1-byte instruction is a performance win. Also, the use of `ah` and `al` is going to result in stalls on many processors. Some rename these separately, but most don't. – Cody Gray Nov 26 '16 at 12:35
  • @CodyGray can you explain it for me? From what I can find on Internet the `stosb` looks quite on par with `mov`, at least when used in this code with padding around by other instructions. Or do you mind the per-byte operation and `stosd` variant of code would be fine with you? I can add that one for fun to the answer, but I'm not sure what's wrong with string instructions on x86, since recent architectures they shouldn't pose any real penalty (and they were good not only on 8088, but still on 80386-486, when I actually *did* profile my code). But I don't profile these short examples for SO :/ – Ped7g Nov 26 '16 at 14:16
  • @ArashDe the new variant may be yet another example for you, how differently the same result can be achieved :) - if you are aware what you want, and what instructions do. I did try to use another set of instructions, for example the `or` can be replaced (in this special case, not generally!) by `add` or `adc`, but I already used `adc` in previous one. – Ped7g Nov 26 '16 at 14:26
  • Hmm, no, if you're only moving a byte, then it would be silly to use a DWORD-sized instruction. There is of course no way that a bunch of shifts and rotates could ever be faster. I mean just what I said, that I have never found the string instructions to be a win unless you're using them in conjunction with the REP prefix, and even then, only certain string instructions offer reasonable performance. I don't know what you're seeing on the Internet. I can't find anything that suggests STOSB might be faster than storing and incrementing, and it's inconsistent with my recollections. – Cody Gray Nov 26 '16 at 14:35
  • @CodyGray you are missing the point of dword variant, it's processing 4 bits in one iteration, writing for example `"0100"` into buffer with single `stosd`. Of course I'm not storing single character with `stosd`, omg. :) ... I'm either not claiming `stos` are faster, I'm saying they are on-par, when used sparsely. – Ped7g Nov 26 '16 at 15:02
  • @Cody is correct. STOSB is 3 fused-domain uops on Intel SnB-family, but `mov [edi], al` / `inc edi` is only 2 uops, so STOSB is definitely worse, except for code-size. According to [Agner Fog's tables](http://agner.org/optimize), it's still only 3 unfused-domain uops on SnB-family, so I guess it's still only one ALU uop (even though it needs DF as an extra input), but the store doesn't micro-fuse. (Stores are decoded as separate store-address and store-data uops, unlike loads). STOS is 4 m-ops on AMD K10, but 3 on BD-family, so it's worse there, too. – Peter Cordes Nov 27 '16 at 03:26
  • Interestingly, LODSD/Q are efficient on Haswell and later, but LODSB/W are still one extra uop vs. the simple-instruction equivalent. – Peter Cordes Nov 27 '16 at 03:27
  • RCR with count > 1 is *very* slow: 8 uops on Haswell, 6c latency. Even worse on AMD: 15 m-ops with 7c latency on Piledriver. Your STOSD version is unfortunately horrible :( – Peter Cordes Nov 27 '16 at 03:31
  • 1
    The best idea I came up with is `shl edx, 1` / `adc eax, '0'` (carry turns this into '0' or '1') / `shl eax, 8`, repeated 4 times. (Leave out the `shl eax,8` on the last, of course). ADC is 2 uops on CPUs before Broadwell, but I don't think you're going to beat it SETC. You could use SETC with memory destinations, and then use a DWORD memory-destination OR to convert to ASCII, but that's pretty bad (store-forwarding stall, and tons of store instructions). Actually, on Haswell (which doesn't have partial-register stalls) SETC AL / SETC AH and one `shl eax, 16` should be good. – Peter Cordes Nov 27 '16 at 04:05
  • Maybe SHLD to shift a bit from the top of EDX into the bottom of EAX. That doesn't modify EDX so you still need to SHL it separately. And you need to `shl eax, 7` :/ Still, it's better than ADC on Intel SnB-family, where SHLD is only 1 uop. – Peter Cordes Nov 27 '16 at 04:09
  • Of course, BMI2 [`pdep eax, edx, mask_constant`](http://www.felixcloutier.com/x86/PDEP.html) can scatter the low 4 bits of EDX into the 4 bytes of EAX in one instruction, then `or eax, "0000"`. In 64-bit mode, this can go 8 bits/bytes at a time. (See http://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask for an example of using PDEP this way). – Peter Cordes Nov 27 '16 at 04:13
  • But you can do even better with SSE2, basically [do the inverse of PMOVMSKB, which takes multiple instructions but no loop](http://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb), then OR the whole thing to turn it into a 16-byte ASCII string. – Peter Cordes Nov 27 '16 at 04:17
  • @PeterCordes thanks for info. :) Jeez, so `rcr` is slow... The compilers are annoying me a lot, once they don't emit some instruction, it's pretty much dead. :/ (although I can see it may be somewhat difficult to keep that one running on modern architecture) On other hand completely insane instructions like `imul` are now viable... :D – Ped7g Nov 27 '16 at 04:45
  • 1
    @CodyGray and Ped7g: I posted the SSE2 version as an answer, since it's pretty clean. Once you have the input bytes broadcast where you want them, it's really only three instructions to turn them into ASCII digits. – Peter Cordes Nov 27 '16 at 05:51
2

For the decimal-string -> integer part, see NASM Assembly convert input to integer?


You can do the whole thing without any loops, using SIMD to do all the bits in parallel.

Also related:

  • int -> hex string including scalar and SIMD.

  • int -> decimal string (or other non-power-of-2 bases)

  • How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD for a neat SIMD version that's efficient when you want to stop with 0/1 integers instead of ASCII digits. It's specifically optimized for 8 bits -> 8 bytes.

  • Convert 16 bits mask to 16 bytes mask - intrinsics version of this answer; includes a version that converts to ASCII '0' / '1' as well as just 0 / 1 bytes, with SSE2 / SSSE3, and AVX-512.

  • How to create a byte out of 8 bool values (and vice versa)? shows a trick using a 64-bit multiply constant. In 32-bit mode, it takes two multiplies per 8 bits, producing the low and high 4 bits separately (from imul reg, src, 0x08040201 and 0x80402010). For each 4 bytes of output / 4 bits of input, you need and + shr, and to convert to ASCII also add reg, '0000'. But at least you don't have to extract each 4 bits of the input separately, only 8 bits with movzx and use the two halves of 0x8040201008040201.

    That's a lot of instructions total, but better than 1 bit at a time if don't have SSE2 but imul isn't too slow (e.g. Pentium 3, or modern CPUs if you don't want to depend on SSE2). 64-bit mode guarantees that SSE2 is available, or could use this scalar multiply trick 8 bits -> bytes at a time.


The integer -> base 2 string part is simpler than the base10 string->int, or at least can be done efficiently with a few SSE2 instructions and no loop.

This uses the same technique as Evgeny Kluev's answer on a question about doing the inverse of PMOVMSKB, to turn a bit-pattern into a vector of 0 / -1 elements: broadcast-shuffle the input bytes so every vector element contains the bit you want (plus neighbours). AND that to leave only zero or 1, then compare against an all-zero vector.

This version only requires SSE2, so it works on every CPU that can run a 64-bit OS, and some 32-bit-only CPUs (like early Pentium4 and Pentium-M). It can go faster with SSSE3 (one PSHUFB instead of three shuffles to get the low and high bytes where we want them). You could do 8 bits -> 8 bytes at a time with MMX.

I'm not going to try to convert it from NASM to MASM syntax. I have actually tested this, and it works. The x86 32-bit System V calling convention doesn't differ from the 32-bit Windows cdecl calling convention in any ways that affect this code, AFAIK.

;;; Tested and works

;; nasm syntax, 32-bit System V (or Windows cdecl) calling convention:
;;;; char *numberToBin(uint16_t num, char buf[17]);
;; returns buf.

ALIGN 16
global numberToBin
numberToBin:
        movd    xmm0, [esp+4]       ; 32-bit load even though we only care about the low 16 bits.
        mov     eax, [esp+8]        ; buffer pointer

        ; to print left-to-right, we need the high bit to go in the first (low) byte
        punpcklbw xmm0, xmm0              ; llhh      (from low to high byte elements)
        pshuflw   xmm0, xmm0, 00000101b   ; hhhhllll
        punpckldq xmm0, xmm0              ; hhhhhhhhllllllll

        ; or with SSSE3:
        ; pshufb  xmm0, [shuf_broadcast_hi_lo]  ; SSSE3

        pand    xmm0, [bitmask]     ; each input bit is now isolated within the corresponding output byte
        ; compare it against zero
        pxor    xmm1,xmm1
        pcmpeqb xmm0, xmm1          ; -1 in elements that are 0,   0 in elements with any non-zero bit.

        paddb   xmm0, [ascii_ones]  ; '1' +  (-1 or 0) = '0' or 1'

        mov     byte [eax+16], 0    ; terminating zero
        movups  [eax], xmm0
        ret


section .rodata
ALIGN 16

;; only used for SSSE3
shuf_broadcast_hi_lo:
        db 1,1,1,1, 1,1,1,1     ; broadcast the second 8 bits to the first 8 bytes
        db 0,0,0,0, 0,0,0,0     ; broadcast the first 8 bits to the second 8 bytes

bitmask:  ; select the relevant bit within each byte, from high to low for printing
        db 1<<7,  1<<6, 1<<5, 1<<4
        db 1<<3,  1<<2, 1<<1, 1<<0
        db 1<<7,  1<<6, 1<<5, 1<<4
        db 1<<3,  1<<2, 1<<1, 1<<0

ascii_ones:
        times 16 db '1'

Using PSHUFLW to do the reversal in the second shuffle step is faster on old CPUs (first-gen Core2 and older) that have slow 128b shuffles, because shuffling only the low 64 bits is fast. (Compared to using PUNPCKLWD / PSHUFD). See Agner Fog's Optimizing Assembly guide to learn more about writing efficient asm, and other links in the tag wiki.

(Thanks to clang for spotting the possibility).

If you were using this in a loop, you'd load the vector constants into vector registers instead of re-loading them every time.


From asm, you can call it like

    sub     esp, 32

    push    esp           ; buf[] on the stack
    push    0xfba9        ; use a constant num for exmaple
    call    numberToBin
    add     esp, 8
    ;; esp is now pointing at the string

Or call it from C or C++ with the prototype from comments in the asm.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • I would love to see face of the lector, if some student would bring this one... :D ... to be on-topic, why null terminator is written ahead of string itself, any particular reason? I thought it's better to write memory in consecutive order, when possible. ... off-topic: I hope one day somebody extremely bored will profile all these to give those "uops difference" some real "time" value, so I can better imagine it. It looks a bit funny to hunt down for these, when I consider what kind of code is then run in production where I work ("Java" should be enough to explain). – Ped7g Nov 27 '16 at 12:03
  • 1
    @Ped7g: Other than stalls for memory or branch mispredicts, there are basically three dimensions to code performance: front-end throughput (up to 4 fused-domain uops per clock, plus decoder effects), throughput of specific ports (unfused-domain), and latency of loop-carried dependency chains. So in code that's bottlenecked on fused-domain uop throughput, more fused-domain uops cause a somewhat-predictable slowdown. In other code, it might have no effect, or it might have indirect effects from letting out-of-order execution see less far ahead, or from more uop-cache misses. – Peter Cordes Nov 27 '16 at 12:10
  • 1
    @Ped7g: Good question: The 0 byte is stored first so it can complete before the dependency chain that produces the vector is done. I put it after the loads because stores with unknown addresses make life difficult for [memory disambiguation](https://en.wikipedia.org/wiki/Memory_disambiguation). Doing it earlier might possibly delay loading the vector constants on some CPUs. Re: ascending order: I don't need the prefetcher to kick in and do anything. I'm only storing these 17 bytes. The terminator is probably in the same cache line as the vector, so either one heats up cache for the other – Peter Cordes Nov 27 '16 at 12:18
  • And yeah, the idea of someone handing this in as their homework amused me, too, while I was writing it on this question :) – Peter Cordes Nov 27 '16 at 12:19
  • @Ped7g: Storing in descending sequential order is just as good as ascending, for long runs of loads or stores (HW prefetch works). I've never tried, but I think you might get good results from one function writing an array in ascending order, and the next function reading it in descending order. The first read would have a dependency on the last write, so no OOO overlap, but for medium and large arrays, the cache effects would be nice. The first ~32k of reads would be hot in L1D cache, and same for L2 / L3 sizes. But if you read in the same order, the front of the array may already be cold. – Peter Cordes Nov 27 '16 at 12:26
  • @Ped7g: just remembered there was an SSE `atoi()` question earlier this year, with a nice answer using SSE4.2. It's hilariously complex, but just for good measure, I linked it from my answer, to show that the whole thing could be done without looping. :) – Peter Cordes Nov 27 '16 at 12:41
1

Next is an example of using "atoi" to convert the string to number, then use assembly to convert the number to binary:

#include "stdafx.h"
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{   char str[6]; // ◄■■ NUMBER IN STRING FORMAT.
    int num;    // ◄■■ NUMBER IN NUMERIC FORMAT.
    char bin[33] = "                                "; // ◄■■ BUFFER FOR ONES AND ZEROES.
    cout << "Enter a number: ";
    cin >> str;  // ◄■■ CAPTURE NUMBER AS STRING.
    num = atoi(str); // ◄■■ CONVERT STRING TO NUMBER.
    __asm { 
           mov eax, num   // ◄■■ THE NUMBER.
           lea edi, bin   // ◄■■ POINT TO VARIABLE "BIN".
           mov ecx, 32    // ◄■■ NUMBER IS 32 BITS.
        conversion:
            shl eax, 1     // ◄■■ GET LEFTMOST BIT.
            jc  bit1       // ◄■■ IF EXTRACTED BIT == 1
            mov [edi], '0'
            jmp skip
        bit1:
            mov [edi], '1'
        skip :
            inc edi   // ◄■■ NEXT POSITION IN "BIN".
            loop conversion
    }
    cout << bin;
    return 0;
}
  • 1
    i completely understand your point using atoi, this way it is way easier to work with assembly since the input is not string anymore. Thank you very much :D – ArashDe Nov 25 '16 at 22:01
  • If you're going to write the code in C anyway, I can't see any advantage of using inline assembly. You aren't doing anything that can't be done with straightforward C code, and the inline assembly is inevitably going to be slower than letting the compiler generate code. (However, writing a C program that calls `atoi`, disassembling it, and examining the assembly code that the compiler generates for the operation would certainly be instructive!) – Cody Gray Nov 26 '16 at 12:29
  • 1
    @CodyGray: well our instructor purpose is to get us familiar with assembly language.even though using asm instead of c is not optimal in this case as you mentioned, i have the general idea of what to do when i write a new code that will have a separate .cpp and .asm files. – ArashDe Nov 26 '16 at 18:21