8

I wrote a code that numerically uses Legendre polynomials up to some high n-th order. For example:

....
case 8 
p = (6435*x.^8-12012*x.^6+6930*x.^4-1260*x.^2+35)/128; return
case 9 
...

If the vectorx is long this can become slow. I saw that there is a performance difference between say x.^4 and x.*x.*x.*x and thought I could use this to improve my code. I've used timeit and found that for:

x=linspace(0,10,1e6);
f1= @() power(x,4)
f2= @() x.4;
f3= @() x.^2.^2
f4= @() x.*x.*x.*x

f4 is faster by a factor 2 than the rest. However when I go to x.^6 there is very little difference between (x.*x.*x).^2 and x.*x.*x.*x.*x.*x (while all other options are slower).

Is there away to tell what will be the most efficient way to take a power of a vector? Can you explain why there is such a big difference in performance?

Luis Mendo
  • 106,541
  • 12
  • 66
  • 138

3 Answers3

8

This is not exactly an answer to your question, but it may solve your problem:

x2 = x.*x; % or x.^2 or power(x,2), whichever is most efficient
p = ((((6435*x2-12012)*x2+6930)*x2-1260)*x2+35)/128

This way you do the power just once, and only with exponent 2. This trick can be applied to all Legendre polynomials (in the odd-degree polynomials one x2 is replaced by x).

Luis Mendo
  • 106,541
  • 12
  • 66
  • 138
1

Here are some thoughts:

power(x,4) and x.^4 are equivalent (just read the doc).

x.*x.*x.*x is probably optimized to something like x.^2.^2


x.^2.^2 is probably evaluated as: Take the square of each element (fast), and take the square of that again (fast again).

x.^4 is probably directly evaluated as: Take the fourth power of each element (slow).

It is not so strange to see that 2 fast operations take less time than 1 slow operation. Just too bad that the optimization is not performed in the power 4 case, but perhaps it won't always work or come at a cost (input checking, memory?).


About the timings: Actually there is much more difference than a factor 2!

As you call them in a function now, the function overhead is added in each case, making the relative differences smaller:

y=x;tic,power(x,4);toc
y=x;tic,x.^4;toc
y=x;tic,x.^2.^2;toc
y=x;tic,x.*x.*x.*x;toc

will give:

Elapsed time is 0.034826 seconds.
Elapsed time is 0.029186 seconds.
Elapsed time is 0.003891 seconds.
Elapsed time is 0.003840 seconds.

So, it is nearly a factor 10 difference. However, note that the time difference in seconds is still minor, so for most practical applications I would just go for the simple syntax.

Dennis Jaheruddin
  • 19,745
  • 7
  • 58
  • 100
  • 1
    The optimization which is presumably made on `x.*x.*x.*x` behaves strangely. I have tried `x.*.x.* ... .*x` with varying numbers of "x" from 2 to 8, and time is more or less linearly increasing. I would have expected bumps; for example the "8" case (=> `x.^2.^2.^2`: three power operations) should take less time than "7" (=> more power operations) – Luis Mendo Sep 25 '13 at 14:34
  • 2
    @LuisMendo I don't know how to verify, but I could imagine that it only does 1 step (no nested optimization). For 7 it would then reduce to something like: `x.^2*x.^2*x.^2.*x` which would not be slower than `x.^2*x.^2*x.^2.*x.^2` for 8. If doing 8 was faster than doing 7 this way, Mathworks probably would have implented this kind of optimization in the power function. – Dennis Jaheruddin Sep 25 '13 at 14:38
  • Yes, that might be the explanation: no nesting – Luis Mendo Sep 25 '13 at 14:51
  • @DennisJaheruddin, I think you're right. See my answer (that I was composing when you answered) -- nesting is surprisingly 2x slower for the 16th power. – mbauman Sep 25 '13 at 14:58
1

It seems as though Mathworks has special cased squares in its power function (unfortunately, it's all builtin closed source that we cannot see). In my testing on R2013b, it appears as though .^, power, and realpow use the same algorithm. For squares, I believe they have special-cased it to be x.*x.

1.0x (4.4ms):   @()x.^2
1.0x (4.4ms):   @()power(x,2)
1.0x (4.5ms):   @()x.*x
1.0x (4.5ms):   @()realpow(x,2)
6.1x (27.1ms):  @()exp(2*log(x))

For cubes, the story is different. They're no longer special-cased. Again, .^, power, and realpow all are similar, but much slower this time:

1.0x (4.5ms):   @()x.*x.*x
1.0x (4.6ms):   @()x.*x.^2
5.9x (26.9ms):  @()exp(3*log(x))
13.8x (62.3ms): @()power(x,3)
14.0x (63.2ms): @()x.^3
14.1x (63.7ms): @()realpow(x,3)

Let's jump up to the 16th power to see how these algorithms scale:

1.0x (8.1ms):   @()x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x
2.2x (17.4ms):  @()x.^2.^2.^2.^2
3.5x (27.9ms):  @()exp(16*log(x))
7.9x (63.8ms):  @()power(x,16)
7.9x (63.9ms):  @()realpow(x,16)
8.3x (66.9ms):  @()x.^16

So: .^, power and realpow all run in a constant time with regards to the exponent, unless it was special cased (-1 also appears to have been special cased). Using the exp(n*log(x)) trick is also constant time with regards to the exponent, and faster. The only result I don't quite understand why the repeated squaring is slower than the multiplication.

As expected, increasing the size of x by a factor of 100 increases the time similarly for all algorithms.

So, moral of the story? When using scalar integer exponents, always do the multiplication yourself. There's a whole lot of smarts in power and friends (exponent can be floating point, vector, etc). The only exceptions are where Mathworks has done the optimization for you. In 2013b, it seems to be x^2 and x^(-1). Hopefully they'll add more as time goes on. But, in general, exponentiation is hard and multiplication is easy. In performance sensitive code, I don't think you can go wrong by always typing x.*x.*x.*x. (Of course, in your case, follow Luis` advice and make use of the intermediate results within each term!)

function powerTest(x)

f{1} = @() x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x;
f{2} = @() x.^2.^2.^2.^2;
f{3} = @() exp(16.*log(x));
f{4} = @() x.^16;
f{5} = @() power(x,16);
f{6} = @() realpow(x,16);

for i = 1:length(f)
    t(i) = timeit(f{i});
end

[t,idxs] = sort(t);
fcns = f(idxs);

for i = 1:length(fcns)
    fprintf('%.1fx (%.1fms):\t%s\n',t(i)/t(1),t(i)*1e3,func2str(fcns{i}));
end
mbauman
  • 28,560
  • 4
  • 76
  • 106