How to multiply terabyte-sized numbers?

Question

When multiplying very large numbers, you use FFT based multiplication (see Schönhage–Strassen algorithm). For performance reason I'm caching the twiddle factors. The problem is for huge numbers (Gigabyte-sized) I need FFT tables of size 2^30 and more, which occupy too much RAM (16 GB and above). So it seems I should use another algorithm.

There is a software called y-cruncher, which is used to calculate Pi and other constants, which can multiply terabyte-sized numbers. It uses an algorithm called Hybrid NTT and another algorithm called VST (see A Peak into y-cruncher v0.6.1 in section The VST Multiplication Algorithm).

Can anyone shed some light on these algorithms or any other algorithm which can be used to multiply terabyte-sized numbers?

I voted this as "too broad" because a good answer to "how does out-of-core integer multiplication work?" will probably not be short. Knowing Mysticial is an active SO contributor, however, I'm quite happy to retract my close vote if someone does happen to give a good answer. — tmyklebu, Mar 01 '15 at 21:20
Do you want to know how to perform 1234567890^2? I don't know what your question is. — Don Larynx, Mar 01 '15 at 21:23
Remember in grade school when you multiplied columns, carried and then added... Same thing, but you don't have to do it for every power of 10 ... only every 2^n (where n is the number of bits in your largest integer type) — technosaurus, Mar 01 '15 at 21:25
Exactly. Except the number is not 1234567890, but has 10^12 places. And I don't want to square it, but multiply two arbitrary numbers. And I want it to be fast. Grade school method has a runtime of O(n^2). — iblue, Mar 01 '15 at 21:26
To be clear, do you want to multiply 2 numbers of size `m` and `n` and compute the exact `m+n` long product or an approximation? — chux - Reinstate Monica, Mar 02 '15 at 00:38
If you really intend to manipulate such numbers, either your access to them must be entirely sequential left-to-right or vice versa (might be able to handle them as streams), or you need random access to the digits. In the latter case you will obviously need terabyte-scale RAM and I doubt you can realistically afford that. What is the nature of the application that needs this? — Ira Baxter, Mar 02 '15 at 00:58
Yes, @chux, I want to compute the exact n+m (or n+m-1) sized product of two n,m-sized numbers with n,m ~ 1 TB. — iblue, Mar 02 '15 at 02:25
Most of the answer is to linearize access to the inputs, twiddle factors, etc., so you're basically reading in a chunk, doing some processing writing out that chunk, reading in the next chunk, and repeating as needed. That said, it is very disk intensive--some runs of Y-cruncher have destroyed multiple hard drives, and it would almost certainly destroy an SSD in short order as well. It supports (and nearly needs) multiple disk drives to improve bandwidth. — Jerry Coffin, Mar 02 '15 at 06:19
Jerry is spot on. It's all about the implementation and not the algorithm. If the problem you're running into is the twiddle factors, it helps to remember that caching twiddle factors is a space-time tradeoff. You don't have to be at either extreme of the tradeoff. If you get to the point where you need to swap them out to disk, it might be worth it to consider generating them on the fly instead. So you cache some, but not all the twiddles. Also, the Schönhage–Strassen algorithm doesn't need twiddle factors since they degenerate to shifts. — Mysticial, Mar 03 '15 at 19:29
What do you mean by "the twiddle factors degenerate to shifts."? I'm using floating point FFT. Does this apply to NTT? — iblue, Mar 03 '15 at 19:35
Specially the Schönhage–Strassen algorithm - which is a special case of the NTT. It's the only known algorithm that has trivial twiddle factors. (The twiddle factors are all powers-of-two, so they don't need to be cached.) — Mysticial, Mar 03 '15 at 19:42
Alexander Yee hasn't yet published the details of the VST algorithm. A paper of his implied that for FFTs he uses a 3- or 5-pass out-of-core algorithm that is an optimization of the 4-pass one detailed in Bailey's "FFTs in external of hierarchical memory", but I can't find published details on the changes Yee made to it. — Katie, Mar 06 '15 at 19:57
I think this question is largely dependent on your hardware. I've worked on clusters where you could fit TB-numbers just into the active memory. In that case, the problem is mostly about parallelisation. If you want to do this kind of thing on your home computer, it's a very different problem. — Matt Ko, Mar 10 '15 at 14:59

score 5 · Answer 1 · answered Mar 04 '15 at 04:43

5

FFT can be done on the same array with constant number of additional memory (might need to swap the number smartly). Therefore it can be done on the harddisk as well. At worst case it's a log(N)*N times of disk access. It seems a lot slower then doing it on RAM but the overall complexity remains the same.

answered Mar 04 '15 at 04:43

DU Jiaen

890
6
14

1

The complexity may remain the same, but [the actual read times are much, much slower](https://gist.github.com/jboner/2841832). Don't forget that those constant factors matter in practice. – wchargin Mar 04 '15 at 04:45
it's log(N) times of sequential harddisk access of the whole array only, should be much faster than random access. if you implement FFT in a smart way (rearrange the number at the beginning). – DU Jiaen Mar 04 '15 at 05:57
For large numbers (> 4 GB) I can fit 10 bits or less via FFT into one complex data point (2 doubles = 128 bit), so I have to write 12.8 times the size of the actual number if I do FFT on it. This wouldn't be very efficient. – iblue Mar 09 '15 at 09:22

How to multiply terabyte-sized numbers?

1 Answers1