What's an efficient implementation of Conway's Game of Life for low memory uses?

Question

I'm looking for a fast and memory efficient approach for implementing Conway's Game of Life.

Constraints: a 96x128 board, approximately 2kB RAM available and 52MHz processor (see the tech specs here: http://www.getinpulse.com/features).

My current naive solution that represents each cell as a single bit in a matrix (96*128/8=1,536 bytes) works but is too slow. What tricks can be used to improve performance?

Storing the coordinates of live cells (for example in this implementation http://dotat.at/prog/life/life.html) would use too much memory.

Is it possible to read data from the display memory? If so, you might be better off accessing the state directly from the display buffer rather than replicating it all in memory. I notice that the watch itself has 8kB of RAM - is there a reason you're limited to a quarter of that? — Anon., Feb 20 '11 at 20:41
It would seem that the best solution would be not to burn away the rest of the RAM memory on god knows what. According to that link you have 8kb of RAM. Free up as much of the other 6kb as possible... you'd need quite a monster app to use that much, even if you have a big fat stack on lets say 1kb. Get rid of floating point and dynamic memory for starters. — Lundin, Feb 21 '11 at 14:30

score 9 · Accepted Answer · answered Feb 21 '11 at 06:40

Looks like a fun piece of hardware. Storing 1 bit per pixel of a 96x128 pixel display gives 12,288 bits. That's already over half of the 16,384 bits you say are "available". If you can't even store 2 bits per cell, there's not a lot of room to do much.

A few ideas:

You have a 32-bit processor. There are several bit-twiddling tricks to take a bitmap and calculate the number of neighbors of several cells in parallel on such a processor.
It's often faster to store the neighbor counts, incrementing all 8 neighbors at birth and decrementing all 8 neighbors at death, rather than recalculating the number of neighbors from scratch every generation -- but it doesn't look like you have enough memory for this approach.
Perhaps you could use 2x2 pixels per cell (so only 3,072 cells visible) or 3x3 pixels per cell (so only 1,376 cells visible), so your software does less work and so gives the illusion of running faster. (This also frees up enough RAM that you can do the neighbor count mentioned above).
Since many life patterns have large areas of dead space, perhaps have a small bitmap of "live regions" -- say, a 12x16 array of bits, where each bit is "0" if the corresponding 8x8 cell region is entirely dead, and "1" if any of the cells in the corresponding region is alive. The next-generation update only needs to look at live regions and the 1-cell-wide border of dead regions; you can skip checking the 6x6 cell core of dead regions. Also, you can completely skip the entire 8x8 cell region if its 4 nearest neighboring regions are also dead.
Since many life patterns have large areas of static unchanging "still life" patterns, perhaps have a small bitmap of "dynamic regions" -- say, a 12x16 array of bits, where each bit is "0" if the corresponding 8x8 cell region had no changes in the last generation update, and "1" if any of the cells in the corresponding region changed in the last update. The next generation update only needs to look at dynamic regions and the 1-cell-wide border of static regions; you can skip checking the 6x6 cell core of static regions, since it will be the same in the next generation.
If a pattern is "large enough", a representation based on Gosper's Hashlife may be able to store it in less RAM than storing a bitmap directly. Alas, I suspect you're far below the "large enough" threshold.

score 2 · Answer 2 · answered Feb 21 '11 at 12:20

It is quite common with small life universes to make them wrap round on all sides (to make a toroidal universe) but this requires double buffering. In your case this requires 3KB RAM but you only have 2KB.

If you don't wrap then you don't need to double-buffer the whole universe. You just need to avoid overwriting cells before you have finished using them as input to the calculation.

Say your universe is laid out as a conventional bitmap. We are going to treat it as a series of rows arranges sequentially in memory. Say the universe has four rows numbered 0 to 3.

When you calculate the next generation, the new version of row 3 is calculated using rows 2, 3, and 4 (which is blank). You can write the new version of row 3 on top of row 4. Similarly, calculate the new row 2 from rows 1,2,3 and write it on top of row 3, since that data is no longer needed after calculating row 2. The new row 1 is calculated from rows 0,1,2 and written over row 2.

This means you only need memory for one extra row. 97*128 bits is 1552 bytes.

The disadvantage is that your universe scrolls through memory and you have to have some mechanism to deal with this. Either copy it back to its original position after each calculation (which is slow) or arrange to be able to display it from any position in memory, and ensure that your calculation and display systems can wrap around from high to low addresses.

Use of a toroidal universe calculated top-to-bottom will make it necessary to keep a copy of the top row until the bottom row is calculated, but it's only necessary to double-buffer that row and one or two of the most-recently-calculated rows--not the entire universe. — supercat, Feb 18 '15 at 00:24

score 2 · Answer 3 · edited Jun 09 '19 at 08:39

2

Look at the chapter on the Game of Life in "The Zen of Code Optimization" by Michael Abrash. There was an implementation that coded the current and previous states of 3 cells into a single word, and used tricks with lookup tables and carry bits generate the next states. It was blazingly fast.

edited Jun 09 '19 at 08:39

Cœur

32,421
21
173
232

answered Feb 21 '11 at 20:12

AShelly

32,752
12
84
142

score 1 · Answer 4 · answered Feb 23 '11 at 00:13

I would suggest starting with a routine to read a row of the board and generate two new row buffers, H and L such that bit 'n' of H will be set if two or more of (bin n+1, bit n, bit n-1) were set in the original row, and bit 'n' of L will be set if an odd number of (bin n+1, bit n, bit n-1) were set in the original row.

Allocate a total of three pairs of row buffers (call them H0..H2 and L0..L2). Take each row of the source file and compute a pair of buffers and store it in an HL pair, keeping it and the previous two. Examining a word from all six buffers will reveal which of 32 cells in the original matrix should be live if and only if they were previously, and which should be alive regardless of previous state.

For optimal performance in machine code, the three pairs of buffers may be interleaved; this may make it possible to achieve an execution speed of less than two cycles per pixel. Perhaps overkill at 52MHz (with only 12288 pixels, that would be a frame rate of ~4000fps) but speed is cool.

score 0 · Answer 5 · answered Feb 21 '11 at 06:41

0

If you have license to abuse the rest of the 30KB on the device (a.k.a. flash memory), you could always store it there, not ideal, but it is a way to potentially work around the limited RAM.

Efficiency speaking you're going to be trading CPU and memory performance for one another:

For memory efficiency, an array of bits is likely the optimal solution, but you lose CPU efficiency by iterating through that grid.

Depending on the capabilities and the size of memory addresses, you could use a linked list of cells to avoid iterating through the entire grid: would definitely save you time scanning through huge regions of dead cells.

answered Feb 21 '11 at 06:41

David Elner

4,781
5
30
46

I think in this case there actually isn't a space time trade off. Storing the cell states as individual bits allows you to use the inherent parallelism of the processors bit wise integer operations whereas storing each cell state with its own integer primitive carries the sunk cost of the computation on the higher order bits. That said I haven't implemented the technique yet so it's possible there's overhead that mitigates the performance gain. Another potential performance benefit of storing the cell states in individual bits is improved cache efficiency. – M.J. Rayburn Feb 07 '19 at 23:36

Suncat2000 · Answer 6 · 2011-02-22T02:03:11.193

Stick with the bit array and skip the neighbor counts with this trick: create a lookup table with the possible bit patterns of neighboring cell blocks that create or maintain a live cell.

The "index" pattern can be packed into 8 bits if you want to minimize memory: 3 bits from the row above, two bits from the columns to either side, and 3 bits from the row below. You can encode the output as a single bit in a lookup table, taking up a total of only 256 bits. Use the index as a bit-shift count for your lookup array to "calculate" the result. Bit shifts and arithmetic OR operations are still very fast, and this technique eliminates counting neighboring live cells on the fly - or rather, the lookup table pre-processes the count and calculation.

The top processing bottlenecks should be: checking for board edge conditions, i.e. end of a row; word boundaries; extracting and packing neighbor bits as an index value. With a 32-bit processor, you should be able to cycle through 30 cells very quickly before hitting a word boundary. Addressing cell rows could simply be a matter of adding the number of columns/32 if you pack the bits into word-sized ints. Cache the results into two spare new-life rows and copy a whole row when you're done processing one.

There may be some pattern symmetries you can take advantage of to optimize things even more, but this may be enough. I think this technique will keep the processing and memory usage within your limitations.

What's an efficient implementation of Conway's Game of Life for low memory uses?

6 Answers6

Linked