Python's pow function : How to integrate pow function into assembly language

Question

I am designing an operating system, written in Nasm. The purpose is to calculate Fermat's primality test on a low level system in protected mod without multitasking. I use DOS level system calls in DPMI. The code is a standard one and i do not think it is a good idea blowing up this question with long codes. The question is about python's pow(int,exponent,modulus) function. I will obviously calculate very long numbers about 10^100000000 length. (Example: 2**332200000-1) The code gets the input from user or reads this from a file. It takes approx. 40Mb to cache this large sized integer into file or just into memory. This means i just allocate a memory for a size of 40Mb in protected mod. The Fermat's little theorem is working as you know:

if p is prime,

a is integer and gcd(a,p)=1 then,

a**(p-1) mod p = 1

In python, it is being calculated like a charm with no extra effort it gives much back, but extra-ordinary integers like 2^332200000-1 are at a low speed and i decided to make my own operating system-shell, fired on when my computer is booting. The reason is to get the most out of my piety computer system without any system calls which are lowering the speed of my calculations. I have following question:

Is there a website where i can observe and calculate the assembly-code of python's power function ?
If yes or no, can you give me a hint how to do this effectively in assembly very short and speedy ?

The idea is very basic and brief:

The 4-Byte integer will not work in assembly. So decided to read the long hex integers from a file into allocated memory(40Mbyte). When i do calculations with this integer which is very long, for example multiply by 2 then i roll every 4 Byte-Integer to right into a second memory free place. If there is a carry in rest, this will be added to second 4-Byte calculation and so on. It is possible to use memory for these long integers. Everything is ready designed, but the kick to make sense in assembly is just in researching phase. Can you help or inform me in some way.

Again, how to calculate with very, very long numbers in assembly and how to make a power function python's-like with exponent and a modulus inside. How would that seem to be like in code-form.

This question points out [Python C implementation of pow](https://stackoverflow.com/questions/5246856/how-did-python-implement-the-built-in-function-pow). In particular, the first answer by Sven Marnach. The assembly would be from the compiled C code for your machine. — DarrylG, Mar 18 '21 at 17:27
Note that extended-precision integer math is usually about twice as fast with 64-bit registers (twice as much work per instruction), or even 4x faster for multiply because 64x64 => 128-bits `mul r64` is 4x as much work as a 32x32 => 64-bit `mul r32`. (e.g. it takes 4 `mul r32` instructions and multiple adds to do a 64x64 => 128-bit multiply). So you'd probably want to use long mode, not protected. — Peter Cordes, Mar 18 '21 at 20:06
Also *without multitasking* - avoiding context switches is only going to gain you a small bit of performance vs. user-space under a mainstream OS like GNU/Linux, at the cost of giving up all the perf of your other CPU cores. Modern x86 CPUs are all multi-core. You can write your computation in assembly language and call it from C, unless you really want to both write a bootloader *and* work on optimizing the computation. Running under an OS lets you easily use tools like Linux `perf stat` to profile it with hardware performance counters, as well as giving you a filesystem and I/O easily. — Peter Cordes, Mar 18 '21 at 20:09
For existing asm for extended-precision, see https://gmplib.org/repo/gmp/file/tip/mpn/x86_64. For example, left-shift (using SSE2): https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/fastsse/lshift.asm, or multiply tuned for Sandybridge (https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/coreisbr/mul_basecase.asm). (See also the mulx/adx version if you have a Broadwell or newer.) — Peter Cordes, Mar 18 '21 at 20:17
For asm optimization, see https://www.agner.org/optimize/, and other links in https://stackoverflow.com/tags/x86/info — Peter Cordes, Mar 18 '21 at 20:18
Also note that Python's extended-precision storage format is a compromise to let it use portable C, without access to a hardware carry flag. IIRC, it uses values up to 2^30 in 32-bit chunks, leaving 2 bits per int wasted, and has to normalize results back into that format. See https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf for examples of 512x512-bit multiply (with plain mul, and with mulx + adox) — Peter Cordes, Mar 18 '21 at 20:57
But see also [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973) - spare bits can let you delay normalization, actually making speedups from SSE/AVX/FMA possible. (Yes, using FP FMA to get exact integer math done on mantissas.) But you have to design your whole approach around taking advantage of this, and it's not simple. Start by understanding the normal add/adc way first. — Peter Cordes, Mar 18 '21 at 21:02

Python's pow function : How to integrate pow function into assembly language

0 Answers0