Questions tagged [branch-prediction]

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

Why is it faster to process a sorted array than an unsorted array? Stack Overflow's highest-voted question and answer is a good introduction to the subject.

Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

Two-way branching is usually implemented with a conditional jump instruction. A conditional jump can either be "not taken" and continue execution with the first branch of code which follows immediately after the conditional jump - or it can be "taken" and jump to a different place in program memory where the second branch of code is stored.

It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline.

Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed. If it is later detected that the guess was wrong then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.

The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the greater the need for a good branch predictor.

Source: http://en.wikipedia.org/wiki/Branch_predictor

The Spectre security vulnerability revolves around branch prediction:

Other resources

Special-purpose predictors: Return Address Stack for call/ret. ret is effectively an indirect branch, setting program-counter = return address. This would be hard to predict on its own, but calls are normally made with a special instruction so modern CPUs can match call/ret pairs with an internal stack.

Computer architecture details about branch prediction / speculative execution, and its effects on pipelined CPUs

Why is it faster to process a sorted array than an unsorted array?
Branch prediction - Dan Luu's article on branch prediction, adapted from a talk. With diagrams. Good introduction to why it's needed, and some basic implementations used in early CPUs, building up to more complicated predictors. And at the end, a link to TAGE branch predictors used on modern Intel CPUs. (Too complicated for that article to explain, though!)
Slow jmp-instruction - even unconditional direct jumps (like x86's jmp) need to be predicted, to avoid stalls in the very first stage of the pipeline: fetching blocks of machine code from I-cache. After fetching one block, you need to know which block to fetch next, before (or at best in parallel with) decoding the block you just fetched. A large sequence of jmp next_instruction will overwhelm branch prediction and expose the cost of misprediction in this part of the pipeline. (Many high-end modern CPUs have a queue after fetch before decode, to hide bubbles, so some blocks of non-branchy code can allow the queue to refill.)
Branch target prediction in conjunction with branch prediction?
What branch misprediction does the Branch Target Buffer detect?

Cost of a branch miss

Characterizing the Branch Misprediction Penalty is a nice paper that models the real effect on a superscalar Out-of-Order pipeline of branch misses as well as data cache misses and I-cache misses. And how long it takes the CPU to return to a steady-state of lots of instructions in flight hiding execution latencies. Ramping back up to that state costs time, too, which is why true mispredict cost is context-sensitive, on top of how many cycles before starting execution on the correct path.
What exactly happens when a skylake CPU mispredicts a branch? and Does a branch misprediction flush the entire pipeline, even for very short if-statement body? - fast recovery using a Branch Order Buffer snapshot of the out-of-order state can start sooner and allow work on instructions before the branch to continue, vs. treating it like other exceptions and flushing back to the retirement state.
Avoid stalling pipeline by calculating conditional early - detecting a mispredict early can let fast-recovery throw away less work, e.g. unrolling a loop can let the loop-overhead work run ahead of the rest of the loop, and only lose some front-end throughput on the hard-to-predict loop-exit branch, not much useful back-end work.
Will Speculative Execution Follow Into an Expensive Operation? - cost of "expensive" operations in the shadow of a mispredict.

Modern TAGE predictors (in Intel CPUs for example) can "learn" amazingly long patterns, because they index based on past branch history. (So the same branch can get different predictions depending on the path leading up to it. A single branch can have its prediction data scattered over many bits in the branch predictor table). This goes a long way to solving the problem of indirect branches in an interpreter almost always mispredicting (X86 prefetching optimizations: "computed goto" threaded code and Branch prediction and the performance of interpreters — Don't trust folklore), or for example a binary search on the same data with the same input can be really efficient.

Static branch prediction on newer Intel processors - according to experimental evidence, it appears Nehalem and earlier do sometimes use static prediction at some point in the pipeline (backwards branches default to predicted-taken, forward to not-taken.) But Sandybridge and newer seem to be always dynamic based on some history, whether it's from this branch or one that aliases it. Why did Intel change the static branch prediction mechanism over these years?

Cases where TAGE does "amazingly" well

Understanding branch prediction efficiency
First method call takes 10 times longer than consecutive calls with the same data (branch prediction can "learn" the branching for repeated sorting of the same data. Usually not useful in real programs (where you'd save the sorted data instead of re-sorting), just a problem to work around for microbenchmarking)
discussion in comments on Indexed branch overhead on X86 64 bit mode has more about why TAGE does well there.

Assembly code layout: not so much for branch prediction, but because not-taken branches are easier on the front-end than taken branches. Better I-cache code density if the fast-path is just a straight line, and taken branches mean the part of a fetch block after the branch isn't useful.

Superscalar CPUs fetch code in blocks, e.g. aligned 16 byte blocks, containing multiple instructions. In non-branching code, including not-taken conditional branches, all of those bytes are useful instruction bytes.

What is the effect of ordering if...else if statements by probability? (compilers guess which way if conditions will go, and lay out branches to minimize branches, especially taken branches, on the expected fast path.) Or profile-guided optimization has real data instead of guesses based on heuristics.

Or GNU C __builtin_expect / likely() / unlikely() lets programmers supply branch-direction hints which might be helpful if they're correct, but they can easily get out of date as code is modified.
How does the likely/unlikely macros in the Linux kernel works and what is their benefit? (the top answer that claims it helps branch prediction is wrong, some others have real examples of asm code layout differences.)
Are BOOST_LIKELY and __builtin_expect still relevant?
Branch-aware programming
Is there a compiler hint for GCC to force branch prediction to always go a certain way?
Is it possible to tell the branch predictor how likely it is to follow the branch? (Pentium 4 had branch hints as prefixes on instructions, silently ignored by earlier and later CPUs. PowerPC also experimented with that. Other than that, no, likely/unlikely affect code layout and/or other optimizations to make the expected case faster.

Branchless code: using `cmov` or other tricks to avoid branches

This is the asm equivalent of replacing if (c) a=b; with a = c ? b : a;. If b doesn't have side-effects, and a isn't a potentially-shared memory location, compilers can do "if-conversion" to do the conditional with a data dependency on c instead of a control dependency.

(C compilers can't introduce a non-atomic read/write: that could step on another thread's modification of the variable. Writing your code as always rewriting a value tells compilers that it's safe, which sometimes enables auto-vectorization: AVX-512 and Branching)

Potential downside to cmov in scalar code: the data dependency can become part of a loop-carried dependency chain and become a bottleneck, while branch prediction + speculative execution hide the latency of control dependencies. The branchless data dependency isn't predicted or speculated, which makes it good for unpredictable cases, but potentially bad otherwise.

https://yarchive.net/comp/linux/cmov.html - Linus Torvalds explains why cmov often sucks: most branches are predictable, and a branch takes even calculation of the branch condition off the critical path.
gcc optimization flag -O3 makes code slower than -O2 - Analysis of a real case where GCC's if-conversion heuristic isn't expecting sorted data, and a longer loop-carried dep chain with cmov is creates a bottleneck.

310 questions

25552

votes

26 answers

Why is processing a sorted array faster than processing an unsorted array?

Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data miraculously makes the code almost six times faster: #include #include #include int main() { //…

asked Jun 27 '12 at 13:51

GManNickG

459,504
50
465
534

423

votes

5 answers

Why is (a*b != 0) faster than (a != 0 && b != 0) in Java?

I'm writing some code in Java where, at some point, the flow of the program is determined by whether two int variables, "a" and "b", are non-zero (note: a and b are never negative, and never within integer overflow range). I can evaluate it with if…

java performance processing-efficiency microbenchmark branch-prediction

asked Feb 21 '16 at 01:51

Maljam

6,048
3
14
30

191

votes

10 answers

What is the effect of ordering if...else if statements by probability?

Specifically, if I have a series of if...else if statements, and I somehow know beforehand the relative probability that each statement will evaluate to true, how much difference in execution time does it make to sort them in order of probability?…

c++ performance if-statement optimization branch-prediction

asked Oct 19 '17 at 15:17

Carlton

4,010
2
19
35

172

votes

4 answers

An expensive jump with GCC 5.4.0

I had a function which looked like this (showing only the important part): double CompareShifted(const std::vector& l, const std::vector &curr, int shift, int shiftY) { ... for(std::size_t i=std::max(0,-shift);i

c++ gcc x86 conditional-statements branch-prediction

asked Dec 06 '16 at 09:22

Jakub Jůza

1,113
1
7
13

151

votes

1 answer

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes. So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got: Sorted - 0.549702s: ~/d/so_sorting_faster$…

c++ performance clang cpu-architecture branch-prediction

asked Mar 07 '21 at 20:57

DimanNe

1,418
2
10
14

118

votes

8 answers

Is there a compiler hint for GCC to force branch prediction to always go a certain way?

For the Intel architectures, is there a way to instruct the GCC compiler to generate code that always forces branch prediction a particular way in my code? Does the Intel hardware even support this? What about other compilers or hardwares? I would…

c++ gcc intel pragma branch-prediction

asked May 08 '15 at 18:54

WilliamKF

36,283
61
170
271

109

votes

16 answers

Is "IF" expensive?

I can't, for the life of me, remember what exactly our teacher said that day and I'm hoping you would probably know. The module is "Data Structures and Algorithms" and he told us something along the lines of: The if statement is the most expensive …

if-statement language-agnostic branch-prediction

asked Nov 24 '08 at 20:18

pek

16,939
27
83
98

votes

5 answers

Why is a conditional move not vulnerable for Branch Prediction Failure?

After reading this post (answer on StackOverflow) (at the optimization section), I was wondering why conditional moves are not vulnerable for Branch Prediction Failure. I found on an article on cond moves here (PDF by AMD). Also there, they claim…

performance assembly optimization cpu-architecture branch-prediction

asked Jan 02 '13 at 23:52

Martijn Courteaux

63,780
43
187
279

votes

7 answers

In Java, can & be faster than &&?

In this code: if (value >= x && value <= y) { when value >= x and value <= y are as likely true as false with no particular pattern, would using the & operator be faster than using &&? Specifically, I am thinking about how && lazily evaluates the…

java performance processing-efficiency microbenchmark branch-prediction

asked Sep 20 '16 at 07:34

SusanW

1,493
1
10
21

votes

3 answers

What does `rep ret` mean?

I was testing some code on Visual Studio 2008 and noticed security_cookie. I can understand the point of it, but I don't understand what the purpose of this instruction is. rep ret /* REP to avoid AMD branch prediction penalty */ Of course I…

assembly x86 micro-optimization branch-prediction

asked Dec 11 '13 at 17:48

Devolus

20,356
11
56
104

votes

5 answers

When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams. However the performance with Streams is always turning out to be worse than traditional loops. int totalSize = 32768; int filterValue = 1280; int[] array =…

java performance java-8 java-stream branch-prediction

asked Dec 22 '16 at 08:26

Kishore Bandi

4,933
1
25
41

votes

6 answers

Portable branch prediction hints

Is there any portable way of doing branch prediction hints? Consider the following example: if (unlikely_condition) { /* ..A.. */ } else { /* ..B.. */ } Is this any different than doing: if (!unlikely_condition) { /* ..B.. */ …

c++ c optimization c++20 branch-prediction

asked Sep 13 '10 at 17:35

Oskar N.

7,917
2
20
21

votes

8 answers

Branch-aware programming

I'm reading around that branch misprediction can be a hot bottleneck for the performance of an application. As I can see, people often show assembly code that unveil the problem and state that programmers usually can predict where a branch could go…

c++ c performance optimization branch-prediction

asked Sep 15 '15 at 08:48

Paolo M

11,431
5
51
69

votes

1 answer

How has CPU architecture evolution affected virtual function call performance?

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz. It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction…

x86 cpu cpu-architecture virtual-functions branch-prediction

asked Aug 30 '11 at 10:38

spraff

29,265
19
105
197

votes

2 answers

Branchless internal merge slower than internal merge with branch

I recently asked a question on Code Review to review a sorting algorithm named QuickMergeSort. I won't get in the details, but at some point the algorithm performs an internal mergesort: instead of using additional memory to store the data to merge,…

c++ performance sorting branch-prediction

asked Dec 13 '16 at 19:53

Morwenn

19,202
10
89
142

2 3

…

20 21 Next

Questions tagged [branch-prediction]

The Spectre security vulnerability revolves around branch prediction:

Other resources

Branchless code: using cmov or other tricks to avoid branches

Branchless code: using `cmov` or other tricks to avoid branches