50

I remember reading somewhere that to really optimize & speed up certain section of the code, programmers write that section in Assembly language. My questions are -

  1. Is this practice still done? and How does one do this?
  2. Isn't writing in Assembly Language a bit too cumbersome & archaic?
  3. When we compile C code (with or without -O3 flag), the compiler does some code optimization & links all libraries & converts the code to binary object file. So when we run the program it is already in its most basic form i.e. binary. So how does inducing 'Assembly Language' help?

I am trying to understand this concept & any help or links is much appreciated.

UPDATE: Rephrasing point 3 as requested by dbemerlin- Because you might be able to write more effective assembly code than the compiler generates but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can.

Ryan Tenney
  • 1,805
  • 2
  • 16
  • 29
Srikar Appalaraju
  • 66,073
  • 51
  • 206
  • 260

14 Answers14

32

The only time it's useful to revert to assembly language is when

  • the CPU instructions don't have functional equivalents in C++ (e.g. single-instruction-multiple-data instructions, BCD or decimal arithmetic operations)

    OR

  • for some inexplicable reason - the optimiser is failing to use the best CPU instructions

...AND...

  • the use of those CPU instructions would give some significant and useful performance boost to bottleneck code.

Simply using inline assembly to do an operation that can easily be expressed in C++ - like adding two values or searching in a string - is actively counterproductive, because:

  • the compiler knows how to do this equally well
    • to verify this, look at its assembly output (e.g. gcc -S) or disassemble the machine code
  • you're artificially restricting its choices regarding register allocation, CPU instructions etc., so it may take longer to prepare the CPU registers with the values needed to execute your hardcoded instruction, then longer to get back to an optimal allocation for future instructions
    • compiler optimisers can choose between equivalent-performance instructions specifying different registers to minimise copying between them, and may choose registers in such a way that a single core can process multiple instructions during one cycle, whereas forcing everythingt through specific registers would serialise it
      • in fairness, GCC has ways to express needs for specific types of registers without constraining the CPU to an exact register, still allowing such optimisations, but it's the only inline assembly I've ever seen that addresses this
  • if a new CPU model comes out next year with another instruction that's 1000% faster for that same logical operation, then the compiler vendor is more likely to update their compiler to use that instruction, and hence your program to benefit once recompiled, than you are (or whomever's maintaining the software then is)
  • the compiler will select an optimal approach for the target architecture its told about: if you hardcode one solution then it will need to be a lowest-common-denominator or #ifdef-ed for your platforms
  • assembly language isn't as portable as C++, both across CPUs and across compilers, and even if you seemingly port an instruction, it's possible to make a mistake re registers that are safe to clobber, argument passing conventions etc.
  • other programmers may not know or be comfortable with assembly

One perspective that I think's worth keeping in mind is that when C was introduced it had to win over a lot of hardcore assembly language programmers who fussed over the machine code generated. Machines had less CPU power and RAM back then and you can bet people fussed over the tiniest thing. Optimisers became very sophisticated and have continued to improve, whereas the assembly languages of processors like the x86 have become increasingly complicated, as have their execution pipelines, caches and other factors involved in their performance. You can't just add values from a table of cycles-per-instruction any more. Compiler writers spend time considering all those subtle factors (especially those working for CPU manufacturers, but that ups the pressure on other compilers too). It's now impractical for assembly programmers to average - over any non-trivial application - significantly better efficiency of code than that generated by a good optimising compiler, and they're overwhelmingly likely to do worse. So, use of assembly should be limited to times it really makes a measurable and useful difference, worth the coupling and maintenance costs.

Tony Delroy
  • 94,554
  • 11
  • 158
  • 229
15

First of all, you need to profile your program. Then you optimize the most used paths in C or C++ code. Unless advantages are clear you don't rewrite in assembler. Using assembler makes your code harder to maintain and much less portable - it is not worth it except in very rare situations.

Community
  • 1
  • 1
sharptooth
  • 159,303
  • 82
  • 478
  • 911
  • 1
    profile my program? You mean this would help me decide if I want to use Assembly? – Srikar Appalaraju Nov 17 '10 at 08:38
  • 2
    @MovieYoda: No, it helps you figure out where the bottleneck is. That way, you don't waste your time trying to optimize a piece of code that isn't even a major factor in performance. Generally, writing assembly in C or C++ code should be done only as a very last resort. Often, just using different algorithms or data structures will speed up code. – In silico Nov 17 '10 at 08:39
  • Yes as it will tell you where your program is spending most of it's time and would benefit from optimisation. However you should look to see if your code would benefit from a better algorithm than brute force assembler. – graham.reeds Nov 17 '10 at 08:39
  • 4
    @MovieYoda: Yes, you might find such dumb pieces of code that just rewriting them (still in C or C++) will give a tremendous boost. For example, if you call `strlen()` in a loop while the string length doesn't change rewriting that in assembler is waste of time - you just use a temporary variable to store length and (magic!) you program likely runs noticeably faster. – sharptooth Nov 17 '10 at 08:40
  • 3
    @MovieYoda: Here's a piece I did (http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773) showing how to find code that is actually worth optimizing, and cycle-squeezing (like writing asm) is almost never what is needed. – Mike Dunlavey Nov 17 '10 at 14:23
11

(1) Yes, the easiest way to try this out is to use inline assembly, this is compiler dependent but usually looks something like this:

__asm
{
    mov eax, ebx
}

(2) This is highly subjective

(3) Because you might be able to write more effective assembly code than the compiler generates.

Andreas Brinck
  • 47,252
  • 14
  • 79
  • 112
  • cool! So this practice is called 'inline assembly'! Nice... So basically, this practice is severely hardware & platform dependent? because each hardware & platform have small variations in their instruction set? – Srikar Appalaraju Nov 17 '10 at 08:46
  • 6
    You might want to change (3) to `Because you might be able to write more effective assembly code than the compiler generates` but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can. – Morfildur Nov 17 '10 at 08:47
  • 2
    I think "might" covers it, I don't think you can be more quantitative than that. – Andreas Brinck Nov 17 '10 at 08:51
  • 2
    I disagree with (1). The easiest way is usually with 'out of line' assembly source files. This way you get proper syntax highlighting and can use an assembler designed for humans with useful features such as more powerful macros. I usually recommend yasm. – CB Bailey Nov 17 '10 at 08:55
  • @Charles I meant easy as in easy to try out. I agree that if you're going to do a lot of assembly coding you're better of with an external assembler. – Andreas Brinck Nov 17 '10 at 09:04
  • 3
    @dbemerlin you do not need to be an expert to optimize compiler generated code. You just need to find the right spot, and know something that the compiler do not take into consideration. Looking at the generated code is the best. Often you will find that the compiler safeguards where no such safeguarding is necessary. Skipping one load in the core of a loop, might do marvels on the right spot in the code. – daramarak Nov 17 '10 at 12:33
  • MSVC inline assembly only works on 32-bit x86, so it's pretty bad and pretty much obsolete. GNU C inline assembly might be a better example because it *can* efficiently wrap a single instruction without forcing the compiler to bounce the input through memory. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. and https://stackoverflow.com/tags/inline-assembly/info. Of course https://gcc.gnu.org/wiki/DontUseInlineAsm if you can get the compiler to produce as good asm (for current CPUs) using intrinsics or pure C; that will be more future proof. – Peter Cordes Jul 02 '19 at 07:42
  • The main reason for knowing asm is to tweak your C to compile more efficiently, or to *be* a compiler developer. Not to actually *write* asm. – Peter Cordes Jul 02 '19 at 07:42
  • The easiest way is definitely not inline assembly. Firstly, the syntax varies from compiler to compiler. Secondly, your code is inserted into surrounding compiler-generated code where you have no clue what registers you are allowed to use (or else you have to learn a special language for using compiler-allocated registers, like with GCC). By far the easiest way to use assembly from C is to write separate functions in an assembly language file, interfacing with C using the platform's documented calling conventions. – Kaz Jul 02 '19 at 22:35
6

You should read the classic book Zen of Code Optimization and the followup Zen of Graphics Programming by Michael Abrash.

Summarily in the first book he explained how to use assembly programming pushed to the limits. In the followup he explained that programmers should rather use some higher level language like C and only try to optimize very specific spots using assembly, if necessary at all.

One motivation of this change of mind was that he saw that highly optimized programs for one generation of processor could become (somewhat) slow in the next generation of the same processor familly compared to code compiled from a high level language (maybe compiler using new instructions for instance, or performance and behavior of existing ones changing from a processor generation to another).

Another reason is that compilers are quite good and optimize aggressively nowaday, there is usually much more performance to gain working on algorithms that converting C code to assembly. Even for GPU (Graphic Cards processors) programming you can do it with C using cuda or OpenCL.

There are still some (rare) cases when you should/have to use assembly, usually to get very fine control on the hardware. But even in OS kernel code it's usually very small parts and not that much code.

kriss
  • 21,366
  • 15
  • 89
  • 109
  • 1
    It's not just using *new* instructions that makes a difference. Tuning choices like whether / how much to unroll, which instructions to use (`loop` vs. `dec/jnz`, `sub`/`mov` vs. `push`) changed immensely between 8086 and 686. And 586 in-order superscalar pentium was an outlier where it could pipeline simple instructions, making it worth it to use more simpler instructions vs. fewer complex instructions. Later CPUs can decode complex ones to multiple uops, but 586 couldn't and would just stall the pipeline. – Peter Cordes Jul 02 '19 at 07:26
  • 1
    Also, tuning for 8086 = usually minimize code size because instruction fetch was *the* major bottleneck. Tuning for modern x86 = minimize uop count, and latency of dependency chains. Anyway yes, unless you need to tune the hell out of one hot loop for a limited set of CPU microarchitectures, you don't need hand-written asm. Compilers are pretty good, but certainly do have missed optimizations all over the place. But usually pretty minor, especially if you're running on modern x86 with wide pipelines to eat up wasted instructions so you still bottleneck mostly on memory. – Peter Cordes Jul 02 '19 at 07:30
4

There's very few reasons to use assembly language these days, even low-level constructs like SSE and the older MMX have built-in intrinsics in both gcc and MSVC (icc too I bet but I never used it).

Honestly, optimizers these days are so insanely aggressive that most people couldn't match even half their performance writing code in assembly. You can change how data is ordered in memory (for locality) or tell the compiler more about your code (through #pragma), but actually writing assembly code... doubt you'll get anything extra from it.

@VJo, note that using intrinsics in high level C code would let you do the same optimizations, without using a single assembly instruction.

And for what it's worth, there have been discussions about the next Microsoft C++ compiler, and how they'll drop inline assembly from it. That speaks volumes about the need for it.

Blindy
  • 55,135
  • 9
  • 81
  • 120
4

I dont think you specified the processor. Different answers depending on the processor and the environment. The general answer is yes it is still done, it is not archaic certainly. The general reason is the compilers, sometimes they do a good job at optimizing in general but not really well for specific targets. Some are really good at one target and not so good at others. Most of the time it is good enough, most of the time you want portable C code and not non-portable assembler. But you still find that C libraries will still hand optimize memcpy and other routines that the compiler simply cannot figure out that there is a very fast way to implement it. In part because that corner case is not worth spending time on making the compiler optimize for, just solve it in assembler and the build system has a lot of if this target then use C if that target use C if that target use asm, if that target use asm. So it still occurs, and I argue must continue forever in some areas.

X86 is is own beast with a lot of history, we are at a point where you really cannot in a practical manner write one blob of assembler that is always faster, you can definitely optimize routines for a specific processor on a specific machine on a specific day, and out perform the compiler. Other than for some specific cases it is generally futile. Educational but overall not worth the time. Also note the processor is no longer the bottleneck, so a sloppy generic C compiler is good enough, find the performance elsewhere.

Other platforms which often means embedded, arm, mips, avr, msp430, pic, etc. You may or may not be running an operating system, you may or may not be running with a cache or other such things that your desktop has. So the weaknesses of the compiler will show. Also note that programming languages continue to evolve away from processors instead of toward them. Even in the case of C considered perhaps to be a low level language, it doesnt match the instruction set. There will always be times where you can produce segments of assembler that outperform the compiler. Not necessarily the segment that is your bottleneck but across the entire program you can often make improvements here and there. You still have to check the value of doing that. In an embedded environment it can and does make the difference between success and failure of a product. If your product has $25 per unit invested in more power hungry, board real estate, higher speed processors so you dont have to use assembler, but your competitor spends $10 or less per unit and is willing to mix asm with C to use smaller memories, use less power, cheaper parts, etc. Well so long as the NRE is recovered then the mixed with asm solution will in the long run.

True embedded is a specialized market with specialized engineers. Another embedded market, your embedded linux roku, tivo, etc. Embedded phones, etc all need to have portable operating systems to survive because you need third party developers. So the platform has to be more like a desktop than an embedded system. Buried in the C library as mentioned or the operating system there may be some assembler optimizations, but as with the desktop you want to try to throw more hardware at so the software can be portable instead of hand optimized. And your product line or embedded operating system will fail if assembler is required for third party success.

The biggest concern I have is that this knowledge is being lost at an alarming rate. Because nobody inspects the assembler, because nobody writes in assembler, etc. Nobody is noticing that the compilers have not been improving when it comes to the code being produced. Developers often think they have to buy more hardware instead of realizing that by either knowing the compiler or how to program better they can improve their performance by 5 to several hundred percent with the same compiler, sometimes with the same source code. 5-10% usually with the same source code and compiler. gcc 4 does not always produce better code than gcc 3, I keep both around because sometimes gcc3 does better. Target specific compilers can (not always do) run circles around gcc, you can see a few hundred percent improvement sometimes with the same source code different compiler. Where does all of this come from? The folks that still bother to look and/or use assembler. Some of those folks work on the compiler backends. The front end and middle are fun and educational certainly, but the backend is where you make or break quality and performance of the resulting program. Even if you never write assembler but only look at the output from the compiler from time to time (gcc -O2 -s myprog.c) it will make you a better high level programmer and will retain some of this knowledge. If nobody is willing to know and write assembler then by definition we have given up in writing and maintaining compilers for high level languages and software in general will cease to exist.

Understand that with gcc for example the output of the compiler is assembly that is passed to an assembler which turns it into object code. The C compiler does not normally produce binaries. The objects when combined into the final binary, are done by the linker, yet another program that is called by the compiler and not part of the compiler. The compiler turns C or C++ or ADA or whatever into assembler then the assembler and linker tools take it the rest of the way. Dynamic recompilers, like tcc for example, must be able to generate binaries on the fly somehow, but I see that as the exception not the rule. LLVM has its own runtime solution as well as quite visibly showing the high level to internal code to target code to binary path if you use it as a cross compiler.

So back to the point, yes it is done, more often than you think. Mostly has to do with the language not comparing directly to the instruction set, and then the compiler not always producing fast enough code. If you can get say dozens of times improvement on heavily used functions like malloc or memcpy. Or want to have a HD video player on your phone without hardware support, balance the pros and cons of assembler. Truly embedded markets still use assembler quite a bit, sometimes it is all C but sometimes the software is completely coded in assembler. For desktop x86, the processor is not the bottleneck. The processors are microcoded. Even if you make beautiful looking assembler on the surface it wont run really fast on all families x86 processors, sloppy, good enough code is more likely to run about the same across the board.

I highly recommend learning assembler for non-x86 ISAs like arm, thumb/thumb2, mips, msp430, avr. Targets that have compilers, particularly ones with gcc or llvm compiler support. Learn the assembler, learn to understand the output of the C compiler, and prove that you can do better by actually modifying that output and testing it. This knowledge will help make your desktop high level code much better without assembler, faster and more reliable.

old_timer
  • 62,459
  • 8
  • 79
  • 150
  • 1
    Well, I wasn't looking for any specific processor. I wanted to understand this practice & the reasons why one would take this approach. Just updating my knowledge... – Srikar Appalaraju Nov 17 '10 at 20:10
3

It depends. It is (still) being done in some situations, but for the most part, it is not worth it. Modern CPUs are insanely complex, and it is equally complex to write efficient assembly code for them. So most of the time, the assembly you write by hand will end up slower than what the compiler can generate for you.

Assuming a decent compiler released within the last couple of years, you can usually tweak your C/C++ code to gain the same performance benefit as you would using assembly.

A lot of people in the comments and answers here are talking about the "N times speedup" they gained rewriting something in assembly, but that by itself doesn't mean too much. I got a 13 times speedup from rewriting a C function evaluating fluid dynamics equations in C, by applying many of the same optimizations as you would if you were to write it in assembly, by knowing the hardware, and by profiling. At the end, it got close enough to the theoretical peak performance of the CPU that there would be no point in rewriting it in assembly. Usually, it's not the language that's the limiting factor, but the actual code you've written. As long as you're not using "special" instructions that the compiler has difficulty with, it's hard to beat well-written C++ code.

Assembly isn't magically faster. It just takes the compiler out of the loop. That is often a bad thing, unless you really know what you're doing, since the compiler performs a lot of optimizations that are really really painful to do manually. But in rare cases, the compiler just doesn't understand your code, and can't generate efficient assembly for it, and then, it might be useful to write some assembly yourself. Other than driver development or the like (where you need to manipulate the hardware directly), the only place I can think of where writing assembly may be worth it is if you're stuck with a compiler that can't generate efficient SSE code from intrinsics (such as MSVC). Even there, I'd still start out using intrinsics in C++, and profile it and try to tweak it as much as possible, but because the compiler just isn't very good at this, it might eventually be worth it to rewrite that code in assembly.

jalf
  • 229,000
  • 47
  • 328
  • 537
2

Take a look here, where the guy improved performances 6 times using assembly code. So, the answer is : it is still being done, but the compiler is doing pretty good job.

BЈовић
  • 57,268
  • 38
  • 158
  • 253
  • so you mean compiler is good enough but if compiler fails to optimize certain sections then use assembly? – Srikar Appalaraju Nov 17 '10 at 08:41
  • 1
    @VJo: Note that the article covers optimization of math-intensive routines via the processor's instruction set. In that specific case, writing assembly may be a benefit, but not in the general case. – In silico Nov 17 '10 at 08:43
  • 2
    @MovieYoda: No compiler will help against really dumb code - first profile the program and try to optimize it without assembler. – sharptooth Nov 17 '10 at 08:43
  • @MovieYoda: For some very special cases, one may be able to take advantage of the available hardware. Generally, however, writing inline assembly in C++ is not done often as the compiler does a good enough job of optimizing code (assuming non-WTF code), and the smarter ones may sometimes optimize better than by hand since optimizations can be very counterintuitive. – In silico Nov 17 '10 at 08:44
  • @sharptooth got it. Loving the links you all are sharing. – Srikar Appalaraju Nov 17 '10 at 08:48
  • Writing assembly code better then what compiler produce is hard, and should be done only after lots of profiling on a big data set. – BЈовић Nov 17 '10 at 09:02
  • Just wanted to add that for this kind of mathematical intensive stuff, there are libraries like Intel MKL or AMD CML which contain highly optimized functions which use assembly kernels. http://www.nag.co.uk/industryArticles/HighPerformanceMathLibraries.pdf – steabert Nov 17 '10 at 09:49
2
  1. "Is this practice still done?" --> It is done in image processing, signal processing, AI (eg. efficient matrix multiplication), and other. I would bet the processing of the scroll gesture on my macbook trackpad is also partially assembly code because it is immediate. --> It is even done in C# applications (see https://blogs.msdn.microsoft.com/winsdk/2015/02/09/c-and-fastcall-how-to-make-them-work-together-without-ccli-shellcode/)

  2. "Isn't writing in Assembly Language a bit too cumbersome & archaic?" --> It is a tool like a hammer or a screwdriver and some tasks require a watchmaker screwdriver.

    1. "When we compile C code (with or without -O3 flag), the compiler does some code optimization ... So how does inducing 'Assembly Language' help?" --> I like what @jalf said, that writing C code in a way you would write assembly will already lead to efficient code. However to do this you must think how you would write the code in assembly language, so eg. understand all places where data is copied (and feel pain each time it is unnecessary). With assembly language you can be sure which instructions are generated. Even if your C code is efficient there is no guarantee that the resulting assembly will be efficient with every compiler. (see https://lucasmeijer.com/posts/cpp_unity/) --> With assembly language, when you distribute a binary, you can test for the cpu and make different branches depending on the cpu features as optimized for for AVX or just for SSE, but you only need to distribute one binary. With intrinsics this is also possible in C++ or .NET Core 3. (see https://devblogs.microsoft.com/dotnet/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/)
David
  • 129
  • 6
1

On my work, I used assembly on embedded target (micro controller) for low level access.

But for a PC software, I don't think it is very usefull.

Benoît
  • 7,330
  • 2
  • 23
  • 29
  • I think games programmers are the only people who use ASM in programming nowadays. – graham.reeds Nov 17 '10 at 08:40
  • aaah! now I remember where I read about this practice. It was written about the game "Need for Speed" using assembly. Naturally I was stunned – Srikar Appalaraju Nov 17 '10 at 08:52
  • @graham.reeds:it was true a few years back, but with GPU layers like CUDA I'm not sure it's still true for game programmers. There is still some small spots for kernel or driver programming, or some embedded devices. – kriss Nov 17 '10 at 09:08
  • @Kriss: Assembler will ALWAYS be used in games development. Regardless of that using assembler is incredibly useful on any platform including PCs. I had some audio convolution code that I re-wrote in assembler and got a 5 times speed up of the convolution over straight C. – Goz Nov 17 '10 at 09:11
  • FMI, do you manage multiple targets ? or using standard x86 instructions (not SSE or others) ? – Benoît Nov 17 '10 at 09:22
  • 2
    @Goz: i'm old enough to be quite sceptic when I hear *always*. It may still be useful to use assembler for a while, but you should not bet on the future. For now even in games, very few people works on game engines (where assembly is useful), and the same engines are used in many games. When optimization becomes hard enough, you get the `Duke Nukem Forever` effect. You haven't yet finished optimizing that the next hardware generation is here and you have to restart from scratch because everything changed and your old optimized code is now less efficient than compiled code on new harware... – kriss Nov 17 '10 at 11:15
  • @Goz: and also 5 times is not much. In the past say 15 years ago, I used to optimize low level game code to assembly. And I rarely got code less than 10 times faster, quite often 50 times faster. That may show how compilers evolved from that time. Also being aware of all cache, behavior, reordering, instruction fetching effects, specialized instruction set, etc. is not easy. – kriss Nov 17 '10 at 11:28
  • @Kriss: Perhaps always is a bit much but if I ever come across a vectorising compiler that can vectorise better than I can, I'll eat my hat (The chocolate one ;)). – Goz Nov 17 '10 at 13:31
1

I have an example of assembly optimization I've done, but again it's on an embedded target. You can see some examples of assembly programming for PCs too, and it creates really small and fast programs, but usually not worth the effort (Look for "assembly for windows", you can find some very small and pretty programs).

My example was when I was writing a printer controller, and there was a function that was supposed to be called every 50 micro-seconds. It has to do reshuffling of bits, more or less. Using C I've been able to do it in about 35microseconds, and with assembly I've done it in about 8 microseconds. It's a very specific procedure but still, something real and necessary.

SurDin
  • 3,003
  • 4
  • 24
  • 28
1

On some embedded devices (phones and PDAs), it's useful because the compilers are not terribly mature, and can generate extremely slow and even incorrect code. I have personally had to work around, or write assembly code to fix, the buggy output of several different compilers for ARM-based embedded platforms.

Graham Borland
  • 57,578
  • 20
  • 131
  • 176
0
  1. Yes. Use either inline assembly or link assembly object modules. Which method you should use depends on how much assembly code you need to write. Usually it's OK to use inline assembly for a couple of lines and switch to separate object modules once if it's more than one function.
  2. Definitely, but sometimes it's necessary. The prominent example here would be programming an operating system.
  3. Most compilers today optimize the code you write in a high-level language much better than anyone could ever write assembly code. People mostly use it to write code that would otherwise be impossible to write in a high-level language like C. If someone uses it for anything else means he is either better at optimization than a modern compiler (I doubt that) or just plain stupid, e.g. he doesn't know what compiler flags or function attributes to use.
flacs
  • 3,483
  • 4
  • 17
  • 19
  • If someone actually did write assembly code with the intention of optimizing a certain code fragment for speed, he would have to know what CPU the code will be running on and how this one particular CPU works internally. Most modern CPUs are able to execute multiple instructions simultaneously (in one core) by analyzing which instructions don't depend on the result of others plus a whole lot of other means of speeding up program execution. – flacs Nov 17 '10 at 09:12
  • And there a people (most likely not the ones who ask this kind of question) who know the internals of cpus, write different code paths for diffents kinds of cpus and are actually able to produce faster code than any compiler. See http://www.agner.org/optimize/ for some interesting stuff. So replacing "anyone" in (3) of your answer by "most people" would be more corrrect. – Gunther Piez Nov 17 '10 at 10:19
  • I think you overestimate the compilers ability to understand the program. Yes, the compiler would know how to shuffle instructions to optimize the use of the pipeline. But it knows very little about which variables/functions depend on each other. Because of this you are able, and I others in my team has written assembler code that outperforms the compiler. – daramarak Nov 17 '10 at 11:04
-1

use this:

__asm__ __volatile__(/*assembly code goes here*/);

the __asm__ can also just be asm.

The __volatile__ stops the compiler from making further optimizations.

Kobus Myburgh
  • 993
  • 1
  • 15
  • 41
  • Welcome to SO! Please read the tour [tour] and [answer] a question. The question was asked 10 years ago and has accepted answers. – GoodJuJu Dec 16 '20 at 14:18
  • GNU C Basic asm (with just a string of instructions, no input/output/clobbers constraints) is obsolete and dangerous, and can't safely be used for much of anything. See https://gcc.gnu.org/wiki/ConvertBasicAsmToExtended. e.g. referencing global variables by symbol name is not safe, neither is modifying any registers, and (in x86-64) neither is using any stack space (unless you skip the red zone). Never use it inside a function. See https://stackoverflow.com/tags/inline-assembly/info for guides to GNU C Extended asm. (e.g. `asm ("add %1, %0" : "+r"(var) : "r"(var))`) – Peter Cordes Dec 16 '20 at 19:34