Different ISA binary profiling results contradiction

Question

I am doing profiling of my code written in CPP targeting RISC architecture. I have two binaries generated one for x86 and other for RISC-V. I have done profiling using perf and gprof. As per Theory of RISC and CISC architecture,but what I have got from perf results is contradictory. Could someone tell me what's wrong here.

Result of Perf:

Performance counter stats for './unit_tests'CISC:

    180,899022      task-clock (msec)         #    0,885 CPUs utilized          
             7      context-switches          #    0,039 K/sec                  
             2      cpu-migrations            #    0,011 K/sec                  
         1.350      page-faults               #    0,007 M/sec                  
   588.853.057      cycles                    #    3,255 GHz                    
   863.377.707      instructions              #    1,47  insn per cycle        
   157.440.034      branches                  #  870,320 M/sec                  
       992.067      branch-misses             #    0,63% of all branches        

   0,204509183 seconds time elapsed

Performance counter stats for './unit_tests'RISC:

    693,264322      task-clock (msec)         #    0,999 CPUs utilized          
            28      context-switches          #    0,040 K/sec                  
             1      cpu-migrations            #    0,001 K/sec                  
         2.400      page-faults               #    0,003 M/sec                  
 2.320.185.432      cycles                    #    3,347 GHz                    
 5.467.630.410      instructions              #    2,36  insn per cycle        
   960.171.812      branches                  # 1385,001 M/sec                  
     7.038.808      branch-misses             #    0,73% of all branches        

   0,693978844 seconds time elapsed

As seen from the above results the time elapsed in RISC is more than CISC and also insn per cylce also more in RISC. I am wondering why is it so. Can someone tell me if I am missing something or interpreting the results wrong?

Do you have transparent `qemu` set up so you can run RISC-V binaries on your x86 system? (Probably, given [your previous question](https://stackoverflow.com/questions/63399539/how-to-run-elf-64-bit-lsb-executable-ucb-risc-v-version-1-gnu-linux-dynamic) about that). If so, you're profiling `qemu` interpreting / emulating RISC-V, not the RISC-V "guest" code itself. Or are you running Linux on a RISC-V system that just happens to have a similar clock speed to your x86? — Peter Cordes, Aug 15 '20 at 17:23
Also, IDK what you think "theory" predicts. More total instructions balanced by higher IPC is not unreasonable when comparing equally aggressive superscalar out-of-order RISC vs. CISC in general (http://www.lighterra.com/papers/modernmicroprocessors/). Of course, not *that many* more instructions, like a factor of 6 more in this case. That's because of qemu dynamic translation and interpreting overhead. The actual RISC-V instruction count (of the guest code inside qemu, or running on a real RISC-V) might be 1.5x at worst, probably less bad than that. — Peter Cordes, Aug 15 '20 at 17:29
@PeterCordes Thanks for the information. My Linux machine has Intel core i3 CPU with 3.40GHz clock speed. But my target board FPGA Artix-7 has internal clock speed exceeding 450MHz. I am running both executables on my Linux machine as I dont have access to hardware (FPGA) as mentioned I am emulating the binary using qemu. How can I conclude Performace results which I obtained with the following theory that. — Yulia, Aug 15 '20 at 18:09
For example: The performance of RISC processors is often two to four times than that of CISC processors because of simplified instruction set. As we see from above CISC has taken 0,204509183 sec but RISC has taken 0,693978844 sec time. This is what I am wondering. Correct me If I am wrong. — Yulia, Aug 15 '20 at 18:09
Well first of all, like I said you're not comparing CPUs, you're comparing native code vs. an emulator on the same CPU. But also, modern x86-64 CPUs are the exception to the rule about CISC being slow. They spend enough transistors on decoding to uops internally (and caching decoded uops) to have single-thread performance better than any current commercial RISC design, when you consider the clock frequency they can hit. (Well maybe IBM POWER can compare, IDK; it's a similarly high-end wide and deep out-of-order design.) Look at https://www.spec.org/cpu2017/results/ for multi-core throughput. — Peter Cordes, Aug 15 '20 at 18:17
Also, go read [Modern Microprocessors A 90-Minute Guide!](http://www.lighterra.com/papers/modernmicroprocessors/) which I linked earlier - it has a whole section about how modern x86 CPUs manage to run fast despite their hard-to-decode variable-length instruction set. For example, as you can see, your CPU managed to achieve 2.36 IPC at 3.4GHz when running a RISC-V emulator. For more low-level details on x86 microarchitectures, see https://www.realworldtech.com/sandy-bridge/ and https://agner.org/optimize/. — Peter Cordes, Aug 15 '20 at 18:20
@PeterCordes As I am trying to use perf on qemu - where I am comparing an emulation to native execution, will the emulation be much slower? Running perf inside a simulation can, sometimes, give meaningful results, but what I have done here is it reasonable to compare so? — Yulia, Aug 17 '20 at 20:56
That would be maybe true if you were using `qemu-system` to run a Linux kernel and `perf` inside the guest. But qemu doesn't even try to do cycle-accurate simulation of a real pipeline, it just runs as fast the host can run the emulator. So there is no such thing as "guest cycles". But you're not even doing that; it looks like you're effectively running qemu-user like `perf stat qemu-riscv ./my_riscv_binary`. (Because that's what transparent binfmt_misc is equivalent to). Try running `perf stat qemu-riscv ./unit_tests'RISC` and see if the result is the same. — Peter Cordes, Aug 17 '20 at 21:02
I am just running qemu as `perf stat ./unit_tests`. @PeterCordes — Yulia, Aug 17 '20 at 21:09
Please read my entire comment. If you run `perf stat qemu-riscv ./unit_tests`, you'll probably see the same results as `perf stat ./unit_tests`. That will confirm that the first way is still measuring the emulator itself, not just the guest code running inside qemu. binfmt_misc just makes `./unit_tests` equivalent to running `qemu-riscv ./unit-tests`; it doesn't have special interaction with perf. — Peter Cordes, Aug 17 '20 at 21:14
@PeterCordes even If I run as mentioned above there is no difference in the results. — Yulia, Aug 18 '20 at 19:08
Performance counter stats for 'qemu-riscv64-static ./unit_tests': ` 616,035295 task-clock (msec) # 0,999 CPUs utilized 20 context-switches # 0,032 K/sec 2 cpu-migrations # 0,003 K/sec 2.383 page-faults # 0,004 M/sec` — Yulia, Aug 18 '20 at 19:10
2.082.256.523 cycles # 3,380 GHz 5.459.662.517 instructions # 2,62 insn per cycle 958.707.553 branches # 1556,254 M/sec 5.120.574 branch-misses # 0,53% of all branches 0,616719850 seconds time elapsed rashmi@rashmi-N350DW:~/OFDM_working/cpp/build$ file unit_tests unit_tests: ELF 64-bit LSB executable, UCB RISC-V, version 1 (GNU/Linux), statically linked, for GNU/Linux 4.15.0, with debug_info, not stripped — Yulia, Aug 18 '20 at 19:10

score 0 · Accepted Answer · answered Aug 18 '20 at 19:26

You're profiling qemu interpreting / emulating RISC-V, not the RISC-V "guest" code inside QEMU. QEMU can't do that; it's not a cycle-accurate simulator of anything.

That's slower and takes more instructions than native code compiled for your x86-64 in the first place.

Using binfmt_misc to transparently run qemu-riscv64 on RISC-V binaries makes ./unit_tests exactly equivalent to qemu-riscv64 ./unit_tests

Your test results prove this: perf stat qemu-riscv64 ./unit_tests gave you approximately the same results as what's in your question.

Somewhat related: Modern Microprocessors A 90-Minute Guide! has some good details about how CPU pipelines work. RISC isn't always better than modern x86 CPUs. They spend enough transistors to run x86-64 code fast.

You actually would expect more total instructions for the same work from a RISC CPU, just not that many more instructions. Like maybe 1.1x or 1.25x?

Performance depends on the microarchitecture, not (just) the instruction set. IPC and total time or cycles depends entirely on how aggressive the microarchitecture is at finding instruction-level parallelism. Modern Intel designs are some of the best at that, even in fairly dense CISC x86 code with memory-source instructions being common.

From the above discussion what I understood is:
The more execution time in performance stats of RISC is not because of RISCV emulation running on the guest machine and the notion of emulation will be much slower than running on real hardware is wrong.
The execution time has nothing to do with the host machine or the emulation time. It is related to cycle count which is ISA relevant.
Hope I am right with the conclusion. Please correct me if I am wrong. — Yulia, Aug 18 '20 at 19:49
@Yulia: no, the extra execution time is approximately all emulation time and depends on how QEMU is written. It tells you *nothing* about how fast your RISC-V code would run on any real RISC-V, whether it's a simple in-order pipeline or a superscalar out-of-order RISC-V. — Peter Cordes, Aug 18 '20 at 19:54

Different ISA binary profiling results contradiction

1 Answers1