5

Focussing on worst case cycle count, I've coded integer multiplication routines for Atmel's AVR architecture.
In one particular implementation, I'm stuck with 2+1 worst cases, for each of which I seek a faster implementation. These multiply multiplicands with an even number of bytes with known values of an 8-bit part of the multiplier:
* 11001011 (20310)
* 10101011 (17110)
* 10101101 (17310)
GCC (4.8.1) computes these as *29*7, *19*9, and *(43*4+1) - a nice fit for a 3-address machine, which the tinyAVR isn't (quite: most have register pair move twice as fast as add). For a two byte multiplicand & product, this uses 9+2, 10+2, and 11+2 additions(&subtractions) and moves, respectively, for 20, 22, and 24 cycles. Radix-4 Booth would use 11+1 additions (under not exactly comparable conditions) and 23 cycles.
For reasons beyond this question, I have 16*multiplicand precomputed (a5:a4, 7 cycles including move); both original and shifted multiplicand might be used later on (but for the MSByte). And the product is initialised to the multiplicand for the following assembler code snippets (in which I use a Booth-style recoding notation: . for NOP, +, and -. owing is a label "one instruction before done", performing a one-cycle fix-up):

locb:; .-..+.++     ..--.-.- ++..++.-   ++.+.-.-    ..--.-.- 29*7
; WC?!? 11001011                                            s   18
    add p0, p0;15   n   16   a4 15      s   16      n   15  s0  17
    adc p1, p1
    sub p0, a4;13   d   13   z  13      s0  15      s4  12  d   15
    sbc p1, a5
    add p0, p0;11   s4  11   d  12      z   13      z   10  s0  13
    adc p1, p1
    add p0, a0; 9   d   9    aZ 10      a4  12      s0  9   z   11
    adc p1, a1
    add p0, p0; 7   s4  7    d  8       a4  10      d   7   d   10
    adc p1, p1
    add p0, a0; 5   s0  5    d  6       d   8       az  5   aZ  8
    adc p1, a1
    rjmp owing; 3   owi 3    s0 4       d   6       owi 3   d   6
              ;                         aZ  4               aZ  4

(The comments are, from left to right, a cycle count ("backwards from reaching done"), and further code sequences using the recodings in the same column in "the label line", using a shorthand of n for negate, d for double (partial) product, a0/a4/s0/s4 for adding or subtracting the multiplicand shifted 0 or 4 bits to the left, z for storing the product in ZL:ZH, and aZ/sZ for using that.)
The other worst cases using macros (and the above-mentioned shorthand):

loab:; .-.-.-.- +.++.-.-    +.+.+.++    .-.-+.++    +.+.++.-
; WC    10101011
    negP    ;15 set4    ;16 add4    ;15 d   16      a4  16
    sub4    ;12 doubleP ;15 pp2Z    ;13 s4  14      d   14
    pp2Z    ;10 subA    ;13 doubleP ;12 z   12      a0  12
    doubleP ; 9 pp2Z    ;11 doubleP ;10 d   11      d   10
    doubleP ; 7 doubleP ;10 addZ    ; 8 d   9       a4  8
    addZ    ; 5 doubleP ; 8 doubleP ; 6 aZ  7       d   6
    owing   ; 3 addZ    ; 6 addA    ; 4 a0  5       s0  4
            ;   add4    ; 4
loac:; +.+.++.. ++-.++.. .-.-.-..
load: ; .-.-..--    .-.-.-.+    .--.++.+    0.9 1.8 0.8 (avg)
; WC    10101101
    negP    ;15 -1      negP    ;16 a4  a0  a0  17
    sub4    ;12 -16-1   sub4    ;13 d   s4  a0
    pp2Z    ;10 -16-1   doubleP ;11 a0  Z   s4
    sub4    ; 9 -32-1   doubleP ; 9 d   d   d
    doubleP ; 7 -64-2   sub4    ; 7 a4  aZ  d
    addZ    ; 5 -80-3   addA    ; 5 d   d   a0
    owing   ; 3         owing   ; 3 a0  a0  s4

(I'm not looking for more than one of the results at any single time/as the result of any single invocation - but if you have a way to get two in less than 23 cycles or all three in less than 26, please let me know. To substantiate my claim to know about CSE, (re)using the notations [rl]sh16 and add16 introduced by vlad_tepesch:

movw        C, x    ; 1
lsh16(C)            ; 3 C=2x
movw        A, C    ; 4
swap        Al      ; 5 
swap        Ah      ; 6
cbr         Ah, 15  ; 7
add         Ah, Al  ; 8
cbr         Al, 15  ; 9
sub         Ah, Al  ;10 A=32x
movw        X, A    ;11
add16(X, C)         ;13 X=34x
movw        B, X    ;14
lsh16(X)            ;16
lsh16(X)            ;18 X=136X
add16(B, X)         ;20 B=170X
add16(B, x)         ;22 B=171X
add16(A, B)         ;24 A=203X
add16(C, B)         ;26 C=173X

‐ notice that the 22 cycles to the first result are just the same old 20 cycles, plus two register pair moves. The sequence of actions beside those is that of the third column/alternative following the label loab above.)
While 20 cycles (15-2(rjmp)+7(*16) doesn't look that bad, these are the worst cases. On an AVR CPU without mul-instruction,
How can one multiply real fast by each of 203, 171, and 173?
(Moving one case just before done or owing, having the other two of these faster would shorten the critical path/improve worst case cycle count.)

greybeard
  • 2,015
  • 5
  • 20
  • 51
  • should not be the x16 multiplicant be calculatable in 5 cycles? a5=x; swap a5; a4=a5; andi a5, 0xf; andi a4, 0xf0; – vlad_tepesch Jul 06 '15 at 21:31
  • The multiplicand has an even number of bytes, 2 (16 bits) in the examples here. – greybeard Jul 06 '15 at 21:39
  • It's pretty hard to optimize for each of them. Better way to do it, as @vlad_tepesch has implied, is to take the advantage of 203-171=32 and 173-171=2 to do higher level optimization. – user3528438 Jul 06 '15 at 22:12
  • Please explain `better` (and `it`). I don't need two of these products at the same time, let alone all three. I need one only, as per external request, and I'd prefer _fast_. – greybeard Jul 06 '15 at 22:33
  • @greybeard wait - your variable multiplicand is 16bit? then your result have to be 24bit = 3 byte! please refine your question and made clear that constraints your inputs have. also note which particular AtTiny series do you use. – vlad_tepesch Jul 07 '15 at 10:10
  • Well, there is what GCC calls "widening mul" (product length = sum of factor lengths), what I'd call "truncating mul" (product length = (max.) factor length), and others are useful for multiply-accumulate sequences in, e.g., filter implementations. The question mentions `8-bit part of the multiplier` and states `two byte multiplicand & product` - any tip how I could make confusion less likely? – greybeard Jul 07 '15 at 14:56

2 Answers2

1

I am not very familiar with the avr-asm but i knwo the AVRs quite well, so i give it a try

If you need this products at the same place you could use common intermediate results and try adding multiples of 2.

(a) 203: +128 +64 +8 +2 +1 =  +128 +32 +8 +2 +1   +32
(b) 171:                      +128 +32 +8 +2 +1
(c) 173: +128 +32 +8 +4 +1 =  +128 +32 +8 +2 +1   +2

The key is that 16bit right shift and 16 addition have to be efficient. I do not know if i overseen something, but:

rsh16 (X): 
  LSR Xh
  ROR Xl

and

add16 (Y,X)
  ADD  Yl, Xl
  ADDC Yh, Xh

Both 2 cycles.

One register pair holds the current x*2^n value (Xh, Xl). and 3 other pairs (Ah, Ab, Bh, Bl, Ch, Cl) hold the results.

 1. Xh <- x;  Xl <- 0    (X = 256*x)
 2. rsh16(X)             (X = 128*x) 
 3. B = X                (B = 128*x)
 4. rsh16(X); rsh16(X)   (X =  32*x)
 5. add16(B, X)          (B = 128*x + 32*x)
 6. A = X                (A =  32*X)
 7. rsh16(X); rsh16(X)   (X =   8*x)
 8. add16(B, X)          (B = 128*x + 32*x+ 8*x)
 9. rsh16(X); rsh16(X)   (X =   2*x)
10. add16(B, X)          (B = 128*x + 32*x + 8*x + 2*x)
11. C = X                (C =   2*X)
12. CLR  Xh              (clear Xh so we only add the carry below)
    add  Bl, x           
    addc Bh, Xh          (B = 128*x + 32*x + 8*x + 2*x + x) 
13. add16(A, B)          (A = 32*X + B)
14. add16(C, B)          (C =  2*X + B)

If I am correct this would sum up to 32 cycles for all three multiplications and requires 9 Registers (1 in, 6 out, 2 temporary)

vlad_tepesch
  • 6,330
  • 1
  • 29
  • 67
  • While your approach seems (almost) valid, it solves a different problem: I'm looking for ways to multiply the value in a (high) register pair faster than 22 cycles - by 171, and/or 173, and/or 203 - _not_ a way to multiply by 171*173*203, or to get all three aforementioned products in less than 66 cycles, combined. (The "almost" being owed to my conviction that rsh16 will not reconstruct the bits shifted out to the left before.) (see [tinyAVR: best known multiplication routines](http://stackoverflow.com/a/31074276/3789665) for an application) – greybeard Jul 06 '15 at 19:17
  • @greybeard i think you should reread my answer. The oucome of this algorithm is A=X*203, B = X*171, C=X*173 – vlad_tepesch Jul 06 '15 at 21:00
  • I noticed. That does not give me x*171 in less than 22 cycles. It does not give me x*173 in less than 22 cycles. It does not give me x*203 in less than 22 cycles. It (sort of, see comment about right shift) _does_ give me all three in less than 66 cycles - which happens not to be what I'm after. (Ah - and no, there doesn't seem to be a "just add carry" as opposed to `sbci r#, 0` being "just subtract borrow") – greybeard Jul 06 '15 at 21:03
  • @greybeard correct this gives you your 3 results in 32 cycles. Double as fast as your initial approach, but only if you require all three results and still 1,5 as fast if you only need 2 Results. did not you think, that if where was a way to calculate a single result faster than the above, that the compiler would use it? – vlad_tepesch Jul 06 '15 at 21:08
  • further optimization i think can only be done by trading memory for speed. put 512Byte a look and a uptable for b into memory. For C left shift x and add it and for A left shift it again 4 times and add it or make another look up table – vlad_tepesch Jul 06 '15 at 21:15
  • *I don't need all three results at once*. I need three procedures. One to get x*171 fast. Another one for x*203. And a third one for x*173. How does a 512 byte table give me products for 16 bit multiplicands? Three "one byte fetches", two with "the same" index and one with an independent one, and a final add, right? Table access on AVR is abysmally slow, even more so for tables of constants, see (and/or improve on) [table lookup of squares for partial products](http://stackoverflow.com/a/29935477/3789665) and [table lookup of partial products](http://stackoverflow.com/a/30066373/3789665). – greybeard Jul 06 '15 at 21:37
  • @greybeard you mentioned the shift in your initial comment. why should this not work? LSR shifts bit0 into Carry and ROR puts the carry into bit7 and shifts right all other bits. so the LSB of Xh is transferred into the MSB of Xl – vlad_tepesch Jul 07 '15 at 13:26
  • Zeroes are shifted into Xh, and x was a 16 bit/2 byte/register value to begin with. Something like the following should solve that problem that is not mine:` movw C, x ; 1 lsh16(C) ; 3 C=2x movw A, C ; 4 swap Al ; 5 swap Ah ; 6 cbr Ah, 15 ; 7 add Ah, Al ; 8 cbr Al, 15 ; 9 sub Ah, Al ;10 A=32x movw X, A ;11 add16(X, C) ;13 X=34x movw B, X ;14 lsh16(X) ;16 lsh16(X) ;18 X=136X add16(B, X) ;20 B=170X add16(B, x) ;22 B=171X add16(A, B) ;24 A=203X add16(C, B) ;26 C=173X` – greybeard Jul 07 '15 at 14:48
  • no in my understanding x was a 8bit number since your multiplication results were 16bit. If your input is 16bit then your result should be 24 that it is not in your question. thats why i started with copying x to Xh that is equivalent of multiplying with 256. – vlad_tepesch Jul 07 '15 at 16:03
0

What did (sort of) work for me:

Triplicate the owing&done-code: no jump for each of the worst cases.

(Making all three faster than tens of "runners up" - meh.)

greybeard
  • 2,015
  • 5
  • 20
  • 51