0

I have some questions on saving computing power on calculations and cpu architecture cause i want to create a 2d game with friends that can handle as many objects and parameters as possible to learn about efficient computing. We're using pygame for now but can implement c and fasm functions.

So:

1. Can you save processing power by reducing the precision of a division or binary places of quotients?

This precision would have to be in binary fractional places ofc, since 0.1 in decimal has a repeating fractional part behind the comma in binary for example, so lets say you would set it to 0.125 which is 0.001=3 fractional places in binary. I mean you could just use long division to get to a certain fractional place but i guess that's not more efficient since the processor will have to load every intermediate result into the register again. I couldn't figure this one out since x86 processors have their own DIV instruction in 4 different versions and i don't know how the cpu executes those. is it possible to write an assembly function that takes precision parameter to gain efficiency for this ? Could it be useful to just use half precision floats or other data types if no accuracy above 2048 is needed?
On the other hand is it useful to reduce the binary places of a constant like PI that could be used thousands of times a frame, since i don't need inch precision till behind Jupiter orbit with the standard 15 decimal places.

2. Reducing trigonometric function precision or using data table

I read that the CORDIC alghorithm does this bit by bit operation, so would a version that you can give the number of binary places to calculate be more efficient? i don't know which algorithm pyhton math uses by default or if something like this is available in something like numpy/anaconda. Alternatively would it be even faster to precalculate a table of sin function results to the needed precision, or to interpolate between those results when more is needed.

3.What is the most efficient collision algorithm you know?

I had the idea that instead of checking objects against objects, which ends up with the square number of objects in checks, you could write the position of objects into a matrix(numpy array) that has a precision that you would set, so for 2d it could be pixel perfect, then just depends on how big your play area is so 1920*1080 Matrix for 1 screen for example. Then the place in the matrix just holds a number that references the object in an object pool. while you write your new position when moving you just check if something is already there. this also has the advantage that bitmap collision comes at 0 extra cost compared to rectangle collision cause you can just write it like that into the matrix. i could imagine this runs more efficiently at a very big number of objects, i got a basic version of this working but would need more optimization. I know you can do spacial subdivision for objects against objects also but i dunno which yields the better result.

lastly, do you have any book recommendations that helps to understand how the processor does its operations and uses the cache/memory to help with this kind of stuff?

i hope someone can do something with this brick of questions, cheers!

NooseN
  • 17
  • 3
  • 1
    This should really be separate questions, especially part 3 about algorithms is basically separate from the part about fast approximations. My current answer only answer part 1, although I'll add a bit about part 2. – Peter Cordes Feb 22 '20 at 18:35

1 Answers1

2

Part 1 - FP math precision vs. speed.

Could it be useful to just use half precision floats or other data types if no accuracy above 2048 is needed?

Single-precision float has faster divide / sqrt than double, and if your code auto-vectorizes (e.g. in C with an ahead-of-time compiler) fits twice as many elements per SIMD vector = twice as much work per unit of execution cost.

Also half size = half the cache footprint, and half the memory bandwidth when you do miss in cache compared to double

Actual half-precision float (16-bit) doesn't have much HW support on CPUs. Half-precision floating-point arithmetic on Intel chips And is probably too inaccurate for a lot of stuff in a game.


The x86 div instruction only does integer division. You're talking about decimal fractions, but computers natively handle integers and binary fractions (floating point). 0.125 is 1 * 2^-3 so it's actual a very "simple" floating point number (mantissa = just the implicit 1).

Most asm operations are the same speed regardless of data, but division / sqrt is an exception. (Floating point division vs floating point multiplication). Of course, dividing by 0.125 is a lot slower than multiply by 8, so do that instead! e.g. mult = 1.0 / divisor before a loop. If you were writing in asm, then you could even do that reciprocal with rcpps to get a 12-bit precision approximation faster than divps. But really you don't need to; hardware FP division is not that slow, especially when you're going to use the reciprocal lots of times.


Of course Python interpreter overhead dwarfs everything else; see for example Why are bitwise operators slower than multiplication/division/modulo?.


In legacy 32-bit code that used the x87 FPU instead of SSE scalar floating point (x86-64 uses that for plain scalar FP math), you could set the FPU's internal rounding precision which would speed up div/sqrt somewhat. switching fpu to single precision.

(Semi-related: Is x87 FP stack still relevant?)

lastly, do you have any book recommendations that helps to understand how the processor does its operations and uses the cache/memory to help with this kind of stuff?

Read Agner Fog's optimization guides. He has C++ and asm optimization guides, and a microarch guide for the real nitty gritty of exactly how CPUs work internally, that you'd read if you were a compiler developer or tuning asm by hand.

See also What Every Programmer Should Know About Memory? for cache / memory performance.


Part 2 - fast approximations

Often you can avoid trig by storing angles as a unit vector of [x, y] component, or [x, y, y]. This is widely used in games, and lets you rotate by multiplying by a rotation vector.

But when you do need a math library function like trig, or log / exp, you can sometimes use a fast approximation. Usually this only makes sense if you're manually vectorizing with SIMD or writing in assembly language. Or possibly pure C / C++ if you can get the compiler to not make a mess of doing integer stuff to an FP bit-pattern. Implementing a multi-step algorithm in pure python will be slower than just calling the standard math library function.

Also SIMD math libraries for SSE and AVX, and Where is Clang's '_mm256_pow_ps' intrinsic? for SIMD math libraries, where you can find vectorized implementations of various functions, some with different speed / precision tradeoffs.

If you're writing in asm, x87 has instructions like fsin but they're implemented with microcode that doesn't have a better speed / precision tradeoff than what you can do with single-uop "normal" instructions, e.g. with SSE2 scalar math.

Before you worry about writing in asm, I'd recommend optimizing for doing multiple calculations at once with SIMD. (Using C with intrinsics, or NumPy.) See https://stackoverflow.com/tags/sse/info for some links, especially these slides: SIMD at Insomniac Games (GDC 2015) for more about how to choose your data layouts so SIMD can work for you. (Avoid using one SIMD vector to store one xy or xyz vector, instead you want SIMD vectors of 4 x components, 4 y components, etc.)

Part 3 - this should be asked as a separate question; I'm not going to answer it.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606