Improving the performance of C code

Question

What is the most unorthodox way to improve the performance of C code? This is no-holds-barred! Everything goes including changing loop structures to gotos, hardcoding anything and everything, using case statements in bizarre ways etc. Don't worry at all about maintainability, readability etc.

p.s. This is practical... and I'm well aware of how to improve the performance of code in reasonable ways (improve algorithms, profile before you optimize etc)

There is no proof that going against the language and what compilers are "optimized" for is gonna give you a performance boost. — AraK, Nov 24 '09 at 02:54
Since when is improving algorithms, profiling before you optimize, etc. reasonable? If that were true, we wouldn't have to work so hard to convince people to do these things. — jason, Nov 24 '09 at 02:58
I voted to re-open. I would have liked to add an answer, namely this link: http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773 — Mike Dunlavey, Nov 24 '09 at 15:16

score 21 · Accepted Answer · answered Nov 24 '09 at 02:53

21

In my experience the most unorthodox way of optimizing C code is to profile the application, identify slowly performing structures and or db hits and then design reasonable solutions around them using Big O analysis.

answered Nov 24 '09 at 02:53

JeffreyABecker

2,584
22
33

This doesn't really deserve +6, since it's not at all unorthodox and goes against the question....... but it is reasonable so I can't downvote you either :\ – mpen Nov 24 '09 at 03:03
4

@Mark: it's a joke, saying that the perfectly logical way to optimize is "unorthodox" because few people actually do it that way. – nobody Nov 24 '09 at 03:07
3

more to the point people tend to look at you like you have and irregular number of heads if you suggest it. – JeffreyABecker Nov 24 '09 at 03:10
1

straightforward solution - simply statistically strange – Andy Dent Nov 24 '09 at 04:18

score 6 · Answer 2 · answered Nov 24 '09 at 02:54

6

Duff's Device is the canonical example. It's so weird that Tom Duff admitted, "This code forms some sort of argument in [the debate about fall-through in case statements], but I'm not sure whether it's for or against".

answered Nov 24 '09 at 02:54

Grandpa

2,852
1
21
31

score 5 · Answer 3 · answered Nov 24 '09 at 02:58

5

Abusing the constant 0x5f3759df to compute inverse square roots quickly has to rank pretty high...

answered Nov 24 '09 at 02:58

Grandpa

2,852
1
21
31

score 4 · Answer 4 · answered Nov 24 '09 at 02:54

4

Profile your code, find the slow spots, and use inline assembly to optimize them.

answered Nov 24 '09 at 02:54

Robert Harvey

168,684
43
314
475

2

When I worked at a game company, we did that But eventually, you hit diminishing returns and you have to look at the big picture. We often found that rearranging the layout of data structures affected overall performance to a large degree. – Nosredna Nov 24 '09 at 03:05
1

You forgot step 4: profile again to be sure your inline assembly didn't actually slow the code. I've seen that happen. – Dour High Arch Nov 27 '09 at 23:46
@Nosredna - I am really interested by your comment.... I am about to ask a question about this – tom Feb 18 '21 at 21:52

score 4 · Answer 5 · answered Nov 24 '09 at 02:57

4

1) Loop unrolling. You save a jump, comparison, and increment every iteration if you don't actually loop.
2) Avoid double-indirection. It's usually faster to perform arithmetic that retrieval, so a[y*height + x] is usually faster than a[y][x]. Plus a one-dimensional array of size MxN saves M (or N) words worth of pointers compared to a rectangular matrix of dimensions MxN.
3) Use ridiculous assembly optimizations whenever possible. For instance, on the x86 architecture, you can use the BSWAP instruction to swap bytes in one operation instead of the normal temp=a; a=b; b=temp; pattern.

And of course, don't forget:
4) Don't do bounds checking or error handling.

That having been said, I'd avoid all of these except (2) in practice.

answered Nov 24 '09 at 02:57

Dathan

6,982
2
23
45

1

Except that most of this is useless because the compiler will do it. – Zan Lynx Nov 24 '09 at 03:03
1

In most cases "unorthodox optimizations" are pointless - pointing out that answers to a pointless question are themselves pointless is kinda... pointless. (c: – Dathan Nov 24 '09 at 03:05
Can't compilers automatically do 1 and 2? And shouldn't there be a library full of assembly hacks for these sorts of things? – mpen Nov 24 '09 at 03:05
Darn. Thought I was being clever - you actually said "useless." Doh! – Dathan Nov 24 '09 at 03:06
I haven't looked at the machine code output of a C compiler lately, but since 1 and 3 were covered in my undergrad compilers course, I assume they're pretty widely implemented. – Dathan Nov 24 '09 at 03:07

score 4 · Answer 6 · answered Nov 24 '09 at 03:00

Use inline assembly?

Seriously though, if by merely changing the C code you can improve performance, chances are you can do it cleanly.

A few exceptions:

1) If you are relying on alignment semantics for pointers of different types often you can perform block operations on pointers that technically expose your application to a bounds overrun condition, but in practice does not because of your system's alignment characteristics. So a memory copy can be performed, by aligning initial char's then the inner block can be done using a long* pointer.

2) It may be possible to copy stack frames in clever ways if you know the memory order in which your compiler assigns local variables. This may allow you to implement co-routines which the language doesn't otherwise support. Coroutines are often a simpler and faster way of implementing some kinds of loop control.

3) Unions are always a bit "hacky" however you use them. Its a way of implementing polymorphism with fairly loose type checking.

4) Use of the C preprocessor as a way of auto-generating code is usually very difficult to debug and read. As such people tend to avoid this.

score 3 · Answer 7 · answered Nov 24 '09 at 02:56

3

You're looking for an unorthodox, no holds-barred, yet general-purpose solution to optimizing C?

Rewrite it in assembly language.

answered Nov 24 '09 at 02:56

RickNZ

17,965
3
46
65

t0mm13b · Answer 8 · 2021-02-22T13:28:47.660

For point 3 above within Dathan's answer, another way of swapping, you can swap variables in an unconventional manner using xor.

int = 3, y = 4;
x = x ^ y; 
y = y ^ x; 
x = x ^ y;

Now x and y are swapped! :)

Another thing, when you are dividing something with 2, it is better to use shift right operator. Same could be said for multiplying by 2, shift left.

In the old Borland C compiler, there was a _stklen property which you can assign in order to reduce stack size and code. I haven't seen anything like that nowadays as compiler technology has advanced since.

When using malloc, it would be better to calloc instead as it initializes memory to zero instead.

Usage of the ternary operator instead of the if/else statement is apparently faster, I guess that compiler writers have got more smarter with regards to machine code generation. I simply cannot provide proof of that in that regard, but it was held true back then when Borland C 3.01 ruled the roost.

Inlining code with assembly routines.

I like this question topic as it reminds me of the old days when memory was precious and having to squeeze a pint into a quart pot and used the hocus pocus tricks of the x86 code. Thanks for posting this question Mr.Database.

I should also mention that when dealing with arrays, it is faster to do so using *(some_array + n) when you have declared char some_array[50]...but that could be irrelevant now with regards to compiler technology today... ;) — t0mm13b, Nov 28 '09 at 01:55

score 2 · Answer 9 · edited May 23 '17 at 12:02

I hear a lot of answers of the form "Try doing X, Y, or Z", but that's like saying "Hear, have a fish, and eat well for a day".

I would rather teach you how to fish - for performance problems. People who say "Profile First" are on the right track but (IMHO) are far too timid.

Here's an example of aggressive performance tuning.

Here's a short explanation of why it works.

Here's a long explanation of why it works.

That will teach you to fish by showing you how to find out where the fish are and how big they are. Once you find them, you can cook them up (fix them) in many wonderful ways. The great thing is, once you find and dispose of one fish (performance problem), the others get bigger and easier to catch.

score 1 · Answer 10 · answered Nov 24 '09 at 02:55

1

Duff's Device & Carmack's Fast InvSqrt.

answered Nov 24 '09 at 02:55

jason

220,745
31
400
507

And Carmack's isn't Carmack's. – Nosredna Nov 24 '09 at 02:57
Right, but it's known as such. – jason Nov 24 '09 at 03:01
1

It's also typically a performance *penalty* on most modern hardware (since most architectures have a hardware reciprocal square root instruction that stays in the FP domain). – Stephen Canon Nov 24 '09 at 03:05

score 1 · Answer 11 · answered Nov 24 '09 at 03:00

Your compiler is almost certainly better at optimizing than your ugly attempts would give you. Most of the historical little tricks are now pointless. People ignoring readability and maintainability tend to write code that ends up less efficient because the real optimizations are made more difficult.

When code has been optimized in all ways possible and still needs performance gain, rewriting critical portions in ASM is the best hope to have any effect.

score 1 · Answer 12 · answered Nov 24 '09 at 03:01

In DSP applications, it's still worth going to assembly language to get the best performance out of the SIMD instructions that C compilers don't do very well with. But that's not really a "C" solution.

Something I do pretty often is use curve-fitting software to replace functions with approximations that are faster to calculate. Sometimes LUTs are still faster than doing a bunch of calculations, but not as often as they used to be.

score 1 · Answer 13 · answered Nov 24 '09 at 03:02

See this chapter, It’s a plain Wonderful Life by Abrash (it's about 5 pages: click 'Next' at the bottom of each screen).

Summary (some quotes from the article):

Table-driven magic (huge lookup table and incredible state machine)
An approach to performance programming that operates at a more efficient, tightly integrated level than you may ever see again
Astonishing economy of effort

score 1 · Answer 14 · answered Nov 24 '09 at 03:14

There is nothing unorthodox left to do for C code performance. All of the effective techniques have been "orthodoxed".

The best I've found is to use a profiler with access to CPU performance counters and pay special attention to cache and branch misses. Add cache prefetches wherever you can and remove unpredictable branches wherever you can.

Don't bother with loop unrolling. If the branch is predictable it is almost free. Let the compiler worry about it.

On some very parallel architectures like IA64 it can be faster to unroll a loop all the way to the end. One example of this is avoiding the C string functions. Use memset to zero a string array, memcpy to set the string and memcmp to compare the entire array against another similar array. This can use 64-bit loads, never has to check for the zero terminator and can be optimized to not loop or branch at all if using a "small" array size of 64 or 128. The memxxx() functions are usually compiler built-ins and very optimized.

score 1 · Answer 15 · answered Nov 28 '09 at 00:04

1

Inline Assembly.

answered Nov 28 '09 at 00:04

s1n

1,426
11
10

Improving the performance of C code

15 Answers15

Linked