Extremely fast method for modular exponentiation with modulus and exponent of several million digits

Question

As a hobby project I'm taking a crack at finding really large prime numbers. The primality tests for this contain modular exponentiation calculations, i.e. a^e mod n. Let's call this the modpow operation to keep the explanation simple. I am wanting to speed up this particular calculation.

Currently I am using GMP's mpz_pown function, but, it is kind of slow. The reason I think it's too slow, is because a function call to GMP's modpow is slower than a full-blown primality test of the software called PFGW for the same large number. (So to be clear, this is just the GMP's modpow part, not my whole custom primality testing routine I am comparing). PFGW is considered the fastest in it's field and for my use case it uses a Brillhart-Lehmer-Selfridge primality test - which also uses the modpow procedure - so it's not because of mathematical cleverness that PFGW is faster in that aspect (please correct me if I'm wrong here). It looks like the bottleneck in GMP is the modpow operation. An example runtime for numbers which have a little over 20,000 digits: GMP's modpow operation takes about 45 seconds and PFGW finishes the whole primality test (involving a modpow) in 9 seconds flat. The difference gets even more impressive with even bigger numbers. GMP uses FFT multiplication and Montgomery reduction for this test comparison, see comments on this post below.

I did some research. So far I understand that the modpow algorithm uses exponentiation by squaring, integer multiplication and modulo reduction - these all sound very familiar to me. Several helper methods could improve the running time of integer multiplication:

To improve the running time of the exponentiation by squaring part, one may use a signed digit representation to reduce the number of multiplications (i.e. bits are represented as 0, 1 or -1, and the bit string is represented in such a way so that it contains many more zeros than in it's original base-2 representation - this reduces the running time of exponentiation by squaring).

For optimizing the modulo part of the operation, I know of these methods:

Montgomery reduction

So here is the 150,000 dollar question: is there a software library available to do a modpow operation efficiently given a very large base, exponent and modulus? (I'm aiming for several millions of digits). If you would like to suggest an option, please try to explain the inner workings of the algorithm for the case with millions of digits for the base, modulus and exponents, as some libraries use different algorithms based on the number of digits. Basically I am looking for a library which supports the techniques mentioned above (or possibly more clever techniques) and it should perform well while running the algorithm (well, better than GMP at least). So far I've searched, found and tried GMP and PFGW, but didn't find these satisfying (PFGW is fast, but I'm just interested in the modpow operation and there is no direct programming interface to that). I'm hoping that maybe an expert in the field can suggest a library with these capabilities, as there seem to be very few that are able to handle these requirements.

Edit: made the question more concise, as it was marked too broad.

The general `mpn_mul`, `mpn_sqr` are used by `mpn_powm`, which will use FFT if the size of the multiplicands are over the Toom-Cook thresholds. Unless GMP has been built with `--disable-fft`. `mpn_powm` uses Montgomery reduction, including the case of an even modulus, as `mpz_powm` decomposes the modulus into an odd factor, and an even (2^i) factor, and applies the Chinese remainder theorem to the results. The Montgomery reduction routine `mpn_redc_n` will also use FFT if the operands are large. — Brett Hale, Jul 11 '14 at 12:19
Thanks for your input Brett. Looks like [`mpn_powm`](https://gmplib.org/manual/Low_002dlevel-Functions.html#Low_002dlevel-Functions) isn't documented ([as mentioned here](https://gmplib.org/list-archives/gmp-discuss/2012-October/005160.html)). That page says `mpn_powm` is faster than `mpz_powm`. Do you happen to know why that is and what the differences between the two are? The [GMP function index](https://gmplib.org/manual/Function-Index.html) also doesn't mention `mpn_redc_n` which sounds like something very useful! Where can I find the documentation of these functions you mentioned? Thanks! — webdevelopersdiary, Jul 11 '14 at 12:30
I looked it up and am putting it here for reference: the thresholds are mentioned [here](https://gmplib.org/manual/Multiplication-Algorithms.html). The maximum [threshold for Toom-Cook](https://gmplib.org/devel/thres/MUL_TOOM8H_THRESHOLD.html) seems to be a little less than 600 limbs in the 'worst' case. A limb can hold 32 or 64 bits, so that means it'll use FFT for integers larger than ~ 600*64 = 39000 bits on my 64 bit computer. I hadn't reached that number of bits yet. I'll do another test run to see what happens above that limit. — webdevelopersdiary, Jul 11 '14 at 12:48
I couldn't edit my last comment anymore, so putting this in a new one. The ~20,000 digit test I did (mentioned in OP) was with a ~70,000 bit number, so I'm in the FFT spectrum already. — webdevelopersdiary, Jul 11 '14 at 12:56
the `mpz_` functions are the recommended interface. the `mpn_` functions do the real work, implementing the lower-level routines. That said, the `mpz_` functions also setup working memory, massage the data, select the best routines for the operands, etc. For such a large calculation, you would save nothing in bypassing the MPZ interface, and would have to essentially replicate the setup code. — Brett Hale, Jul 11 '14 at 13:06
I see. Thanks for explaining. Now I am really wondering why GMP and PFGW have such a huge difference in running time! — webdevelopersdiary, Jul 11 '14 at 13:18

score 9 · Accepted Answer · answered Sep 25 '15 at 03:40

First off, re. the Answer 1 writer's comment "I do not use GMP but I suspect when they wrote they use FFT they really mean the NTT" -- no, when GMP says "FFT' it means a floating-point FFT. IIRC they also have some NTT-based routines, but for bignum mul those are uncompetitive with FFT.

The reason a well-tuned FFT-mul beats any NTT is that the slight loss of per-word precision due to roundoff error accumulation is more than made up for by the vastly superior floating-point capabilities of modern CPU offerings, especially when one considers high-performance implementations which make use of the vector-math capabilities of CPUs such as the x86_64 family, the current iterations of which - Intel Haswell, Broadwell and Skylake - have massive vector floating-point capability. (I don't cite AMD in this regard because their AVX offerings have lagged far behind Intel's; their high-water mark was circa 2002 and since then Intel has been beating the pants off them in progressively-worse fashion each year.) The reason GMP disappoints in this area is that GMP's FFT is, relatively speaking, crap. I have great respect for the GMP coders overall, but FFT timings are FFT timings, you don't get points for effort or e.g. having a really good bignum add. Here is a paper detailing a raft of GMP FFT-mul improvements:

Pierrick Gaudry, Alex Kruppa, Paul Zimmerman: "A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm" [http://www.loria.fr/~gaudry/publis/issac07.pdf]

This is from 2007, but AFAIK the performance gap noted in the snippet below has not been narrowed; if anything it has widened. The paper is excellent for detailing various mathematical and algorithmic improvements which can be deployed, but let's cut to the money quote:

"A program that implements a complex floating-point FFT for integer multiplication is George Woltman’s Prime95. It is written mainly for testing large Mersenne numbers 2^p − 1 for primality in the in the Great Internet Mersenne Prime Search [24]. It uses a DWT for multiplication mod a*2^n ± c, with a and c not too large, see [17]. We compared multiplication modulo 2^2wn − 1 in Prime95 version 24.14.2 with multiplication of n-word integers using our SSA implementation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at 2.4 GHz, see Figure 4. It is plain that Prime95 beats our im- plementation by a wide margin, in fact usually by more than a factor of 10 on a Pentium 4, and by a factor between 2.5 and 3 on the Opteron."

The next few paragraphs are a raft of face-saving spin. (And again, I am personally acquainted with 2 of the 3 authors, and they are all top guys in the field of computational number theory.)

Note that the aforementioned George Woltman, whose Prime95 code has discovered all of the world-record primes since shortly after its debut 20 years ago, has made his core bignum routines available in a general API-ized form called the GWNUM library. You mentioned how much faster PFGW is than GMP for FFT-mul - that's because PFGW uses GWNUM for the core 'heavy lifting' arithmetic, that's where the 'GW' in PFGW comes from.

My own FFT implementation, which has generic-C build support but like George's uses reams of x86 vector-math assembler for high performance on that CPU family, is roughly 60-70% slower than George's on current Intel processor families. I believe that makes it the world's 2nd-fastest bignum-mul code on x86. By way of example, my code is currently running a primality test on a number with roughly 2^29 bits using a 30-Mdouble-length FFT (30*2^20 doubles); thus a little more than 17 bits per input word. Using all four of my 3.3 GHz Haswell 4670 quad's cores it takes ~90 ms per modmul.

BTW, many (if not most) of the world's top bignum-math coders hang out at mersenneforum.org, I encourage you to check it out and ask your questions to the broader (at least in this particular area) expert audience there. I appear under the same handle there as here; George Woltman appears as "Prime95', PFGW's Mark Rodenkirch goes as "rogue".

Great information, thank you! I will drop by at Mersenne Forum some time, appreciate it! — webdevelopersdiary, Sep 25 '15 at 11:43
+1 As the developer of y-cruncher, I've been following GMP for years. And while I also respect what they do, they do seem overly stubborn in their ways. Anything that contains the word, "floating-point" or "parallelism" is immediately thrown out the door. Even provably-correct FFTs will beat the crap out of GMP. So they are denying themselves the superior algorithms and ability to utilize modern hardware. — Mysticial, Mar 20 '16 at 19:40
The only valid reason I can think of for sticking with the SSA algorithm is that it uses very little memory and doesn't require caching twiddle factors. From a library perspective, that does simplify things, but personally I don't think it's worth throwing out 10x in performance for that. (And the memory problem can be trivially fixed by throwing a layer of Karatsuba/Toom-Cook on top of the FFT.) — Mysticial, Mar 20 '16 at 19:45

score 3 · Answer 2 · edited May 23 '17 at 12:23

3

I do not use GMP at all so handle this with that in mind.

I rather use NTT instead of FFT for multiplication

it removes the rounding errors and in comparison to mine FFT implementations optimized to the same point is faster
- C++ NTT implementation
- C++ NTT sqr(mul) implementation
as I mentioned I do not use GMP but I suspect when they wrote they use FFT they really mean the NTT (finite field Fourier transform)
The speed difference of your test and the GMP primality test can be caused by modpow call.

if there are too much calls to it then it causes heap/stack trashing which slows things down considerably. Especially for bignums. Try to avoid heap trashing so eliminate as much data from operands and return calls as possible for frequently called functions. Also sometimes helps to eliminate the bottleneck call by directly copy the function source code into your code instead of the call (or use of macros instead) with use of local variables only.

I think GMP published their source code so finding their implementation of modpow should not be too hard. You just have to use it correctly
just to be clear

you are using big numbers like 20000+ decadic digits which means ~8.4 KBytes per number. Any return or non pointer operand means to copy that amount of data to/from heap stack. This not only takes time but also usually invalidates the CPU's CACHE which also kills performance.

Now multiply this by the algorithm iterations number and you get the Idea. Had similar problems while tweaking many of mine bignum functions and the speedup is often more than 10000% (100 times) even if there is no change in the algorithm used. Just limiting/eliminating heap/stack trashing.

So I do not think you need better modpow implementation just better usage of it but of coarse I may be wrong but without the code you are using is hard to deduce more.

edited May 23 '17 at 12:23

Community

1
1

answered Jul 22 '14 at 08:03

Spektre

41,942
8
91
312

Thanks for your suggestion Spektre. Please grand me some time to look into NTT and run some benchmarks to see if this solves my problem. If it does, I'll be accepting this answer :-) Trying to look up some info on NTT. This is the Number Theoretic Transform, right? I'm reading this is a special case of the Discrete Fourier Transform (DFT). Does that mean that all NTT operations have the same time complexity as DFT? (e.g. multiplication has time complexity O(n log n) if I'm not mistaken) – webdevelopersdiary Jul 23 '14 at 03:50
@webdevelopersdiary yes NTT is nuber theoretic transform ... the same as FFT but on integers modulo prime instead of complex domain that means 2x less data 2x less operations but you need n-th root of unity in that link of mine is also mine implementation class in C++ (optimized more then I could dream of in the beginning) I chosed biggest 32bit nth rooth of unity possible which eliminates all modulo divisions and grants biggest datasets usage ... Also I strongly suggest to look at heap/stack trashing there you can gain even more speed if your code is wasting it... – Spektre Jul 23 '14 at 06:41
That sounds good, thanks! How large is the biggest number operation you ran through your program yourself (in terms of number of bits or decimals)? Also, what do you think is the largest number size it could handle? Or perhaps it's better to ask: how long was the execution time for your personal biggest number? Do you also have data available on what's the maximum number size if I want to program to finish within 1, 4, 12, 24 or 7*24 hours? (Not looking for exact measurements, an estimate is okay.) – webdevelopersdiary Jul 23 '14 at 07:17
@webdevelopersdiary NTT input/otput vector size is limited because of overflow issues !!! For 32bit modular arithmetics is N limited to (2^32)/(max(input[])^2) so bigint must be divided to smaller chunks (i use BYTES so max size of bigint processed is (2^32)/((2^8)^2)=2^16 Bytes=2^14 DWORDs=16384 DWORDs) -> max result size is 523136 bits but you should lower it a bit because used prime is not 2^32 so I think up to 400000 bits should be safe. I do not use so huge numbers that will take hours to multiply see benchmarks for smaller inside those linked questions – Spektre Jul 23 '14 at 07:48
@webdevelopersdiary if that is not enough you still can port it to 64bit variables (I use old 32bit compiler so that is not an option for me) – Spektre Jul 23 '14 at 07:51
I want to try to get into the [top 5000 largest primes](http://primes.utm.edu/primes/), currently #5000 has around ~350,000 decimal digits, which is about 1,162,250 bits long. The largest known prime currently has 57,885,160 bits. If I understand correctly, your program modified to run on a 64 bit system can support 2^48 bytes = 2^56 bits (probably a little less because of choosing a suitable prime), which is more than sufficient to do calculations in the range of world's largest primes. – webdevelopersdiary Jul 23 '14 at 08:59
Are you using Mersenne prime #8 which is 2^31-1? A question just occured to me: if we need a prime, wouldn't this mean that we can only do calculations with numbers up to the range of world's largest prime number? (or otherwise overflow issues occur?) In other words, it couldn't do calculations on integers larger than the current biggest known (Mersenne?) prime? – webdevelopersdiary Jul 23 '14 at 09:00
@webdevelopersdiary 1. the prime p must be a n-th root of unity not just prime !!! if you use p/4 bit count for input data chunks then no overflow occur !!! input vector size should be less then p so you just cut your big number to 8 bit pieces and handle it as vector. you want to use bigger numbers so use 64 bit integers or handle bignum as array of bignum instead of array of integers ... (bignum on bignums) – Spektre Jul 23 '14 at 10:02
@webdevelopersdiary here some info about root of unity and NTT http://stackoverflow.com/a/18547575/2521214 also did some digging in my code for limits (i am very very bellow them so handle with care) max n for `p=0xC0000001` is `n=0x10000000` also `(max(src[])^2)*n
– Spektre Jul 23 '14 at 10:29
I see. Thanks again. I am going to play around with it for a little! – webdevelopersdiary Jul 23 '14 at 10:45
@webdevelopersdiary if you use low bit count chunks then the whole number size gets bigger for 2 bit chunks it is 402653184 bit and for 4 bit chunks it is 50331648 bit. but that also increase the recursion which slow thing down so I recommend to go to 64 bit wide variables instead of mine DWORD and also increase the p to 64 bit wide (but for that you have to find proper p I think there are some sites with them because to compute them the hard way is really slow ... – Spektre Jul 23 '14 at 10:58
@webdevelopersdiary Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/57825/discussion-between-spektre-and-webdevelopersdiary). – Spektre Jul 23 '14 at 11:05
1

I'd be careful about claiming that NTTs are faster than FFTs. Because they really aren't. FFTs are around 3 - 10x faster than even the best of the NTTs - at least until you reach sizes where memory becomes an issue. – Mysticial Mar 20 '16 at 19:49
@Mysticial have edited but what kind of FFT optimization you are having in mind (Hardcoded)? because my measurements are showing me my NTT is Faster then my FFT optimized to the same point. So either the FFT runs on dedicated HW or I am missing some FFT optimizations ... Also the FFT precision errors are too big for bignums needing more precise variables then NTT ( I tested on like 12000+ bits and 32 bit floats are not enough) so NTT wins on every aspect I tested. What I am missing? – Spektre Mar 21 '16 at 08:14
No dedicated hardware, just standard x86 processors. If you find that your FFT is a lot slower or has precision errors, then you simply haven't implemented it correctly. Please see the other answer. ewmayer seems to be from the GIMPS project and they know what they are doing. And no, it's not easy to implement an efficient FFT. My own FFT is *only* 3x faster than my best NTT. The only reason I care about NTTs is that they use less memory, so they're better suited for stuff into the billions or trillions of digits long. – Mysticial Mar 21 '16 at 13:11
Oh, and if you're using 32-bit floats, then you're completely off on the wrong foot. You need to use double-precision. Heck, before the days of SSE2, it was standard to use 80-bit extended precision. – Mysticial Mar 21 '16 at 13:18

Extremely fast method for modular exponentiation with modulus and exponent of several million digits

2 Answers2