The Goal
I'm currently trying out avr-llvm (a llvm that supports AVR as a target). My main goal is to use it's hopefully better optimizer (compared to the one of gcc) to achieve smaller binaries. If you know a little about AVRs you know that you've got only few memory.
I currently work with an ATTiny45, 4KB Flash and 256 Bytes (just bytes not KB!) of SRAM.
The Problem
I was trying compile a simple C program (see below), to check what assembly code is produced and how the machine-code size is developing. I used "clang -Oz -S test.c" to produce assembly output and to optimize it for minimal size. My problem are the needlessly saved register values, knowing that this method would never return.
My Questions...
How can I tell llvm that it can just clobber any register, if needed without saving/restoring it's content? Any ideas how to optimize it even more (e.g. more efficient setup of stack)?
Details / Example
Here is my test program. As mentioned above it was compiled using "clang -Oz -S test.c".
#include <stdint.h>
void __attribute__ ((noreturn)) main() {
volatile uint8_t res = 1;
while (1) {}
}
As you can see it has just one "volatile" variable of type uint8_t (if I don't set it to volatile everything would be optimized out). This variable is set to 1. And there is an endless loop at the end. Now let us have a look at the assembly output:
.file "test.c"
.text
.globl main
.align 2
.type main,@function
main:
push r28
push r29
in r28, 61
in r29, 62
sbiw r29:r28, 1
in r0, 63
cli
out 62, r29
out 63, r0
out 61, r28
ldi r24, 1
std Y+1, r24
.BB0_1:
rjmp .BB0_1
.tmp0:
.size main, .tmp0-main
Yeah! That's a lot of machine code for such a simple program. I just tested some variations and had a look into the reference manual of the AVR... so I can explain what happens. Let's have a look at each part.
This here is the "beef", which is just doing what our c program is about. It loads r24 with value "1" which is stored into memory at Y+1 (Stack Pointer + 1). And there is of course our endless loop:
ldi r24, 1
std Y+1, r24
.BB0_1:
rjmp .BB0_1
Note: that the endless loop is needed. Else the __attribute__ ((noreturn))
is ignored and the stack pointer + saved registers are restored later.
Just before that the pointer in "Y" is set up:
in r28, 61
in r29, 62
sbiw r29:r28, 1
in r0, 63
cli
out 62, r29
out 63, r0
out 61, r28
What happens here is:
- Y (register pair r28:r29 is equivalent to "Y") is loaded from ports 61 and 62, these ports map to some "registers" namely SPL and SPH ("L"ow and "H"igh byte of the "S"tack "P"ointer)
- the loaded value is decremented (sbiw r29:r28)
- the changed value of the stack pointer is saved back to the ports; and I guess to avoid problems: interrupts are disabled before; the state of "cli/sti" [which is stored in register 63 (SREG)] is saved to r0 and later restored to port 63.
This setup of the stack registers seems to be inefficient. To increment the stack pointer I would just need to "push r0" to the stack. Then I could just load the value of SPH/SPL into r29:r28. How ever, this would probably need some changes to llvm's optimizer in source code. The above code makes just sense if more than 3 byte of stack have to be reserved for local variables (even if optimizing -O3, for -Oz it makes sense for up to 6 bytes). HOW EVER... I guess we need to touch the source of llvm for that; so this is out of scope.
More interesting is this part:
push r28
push r29
As main() is not intended to return, this doesn't make sense. This just wastes RAM and flash memory for silly instructions (remember: we have only 64, 128 or 256 bytes SRAM available in some devices).
I investigated this a bit further: If we let main return (e.g. no endless loop) the stack pointer is restored, we have a "ret" instruction at the end AND the registers r28 and r29 are restored from stack via "pop r29, pop 28". But the compiler should know, that if scope of the function "main" is never left, then all registers can be clobbered without having them stored to the stack.
This problem seems just a bit "silly" as we speak about 2 bytes RAM. But just think about what happens if the program starts using the rest of the registers.
All this really changed my view at current "compilers". I thought today there wouldn't be much room for optimization via assembler. But it seems there is...
So, still the question is...
Do you have any idea how to improve this situation (except for filing a bug report / feature request)?
I mean: Are there just some compiler switches I might have overlooked...?
Additional Info
Using __attribute__ ((OS_main))
works for avr-gcc.
Output is as following:
.file "test.c"
__SREG__ = 0x3f
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__CCP__ = 0x34
__tmp_reg__ = 0
__zero_reg__ = 1
.global __do_copy_data
.global __do_clear_bss
.text
.global main
.type main, @function
main:
push __tmp_reg__
in r28,__SP_L__
in r29,__SP_H__
/* prologue: function */
/* frame size = 1 */
ldi r24,lo8(1)
std Y+1,r24
.L2:
rjmp .L2
.size main, .-main
This is (to my opinion) optimal in size (6 instructions or 12 bytes) and also in speed for this sample program. Is there any equivalent attribute for llvm? (clang version '3.2 (trunk 160228) (based on LLVM 3.2svn)' does neither know about OS_task nor knows anything about OS_main).