Bit count : preprocessor magic vs modern C++

Question

Suppose that I want to create a compile time constructed bit count lookup table for 64bit integers in 16 bit chunks. The only way I know to do this is the following code:

#define B4(n) n, n + 1, n + 1, n + 2
#define B6(n)   B4(n),   B4(n + 1),   B4(n + 1),  B4(n + 2)  
#define B8(n)   B6(n),   B6(n + 1),   B6(n + 1),  B6(n + 2)
#define B10(n)  B8(n),   B8(n + 1),   B8(n + 1),  B8(n + 2)
#define B12(n)  B10(n),  B10(n + 1),  B10(n + 1), B10(n + 2)
#define B14(n)  B12(n),  B12(n + 1),  B12(n + 1), B12(n + 2)
#define B16(n)  B14(n),  B14(n + 1),  B14(n + 1), B14(n + 2)
#define COUNT_BITS B16(0), B16(1), B16(1), B16(2)

unsigned int lookup[65536] = { COUNT_BITS };

Is there a modern (C++11/14) way to obtain the same result?

@Lưu Vĩnh Phúc I mean, one can compute bit count for 64bit integers in dividing them in 16bit chunks and summing up the results. This is a trick that makes you save space complexity — Giorgio Gambino, Jul 19 '17 at 11:17
@LưuVĩnhPhúc if you use char[], then you have *exactly enough* address space for the lookup ;) — Caleth, Jul 19 '17 at 11:23
@Caleth yes if you can find a system with 64-bit physical address — phuclv, Jul 19 '17 at 11:28
@LưuVĩnhPhúc & Caleth you can't even find a x86-64 cpu that supports the whole 64-bit *virtual* address space as far as I know. You can only access all of your massive memory if you run without virtual memory. — eerorika, Jul 19 '17 at 11:31
@LưuVĩnhPhúc: Read the question again. The lookup table size is 65536. A number will be processed in 16-bit chunks. No one talks about 64-bit lookup table here. — geza, Jul 19 '17 at 11:33
I would not discount traditional code-generation either. It is hardly necessary here but in my experience C++ meta-programming has a tendency to add a level of complexity to data and code generation tasks with limited introspection which is soon outweighed by offline code generation once things get a bit hairy. Assuming that you've got a decent build system set up. — doynax, Jul 19 '17 at 11:58
Not an answer, but if you are using gcc or clang, I'd simply use `__builtin_popcount` unless you find out that this is a bottleneck. — chtz, Jul 19 '17 at 12:00
Even if this *is* a bottleneck, a LUT won't be faster than the code you get for `__builtin_popcount`. Yes, I've benchmarked it. — Cody Gray, Jul 19 '17 at 12:38
Do you really _need_ a lookup table? Or a _fast_ routine will be enough? In the latter case see the question [How to count the number of set bits in a 32-bit integer?](https://stackoverflow.com/q/109023/733637) and the [answer](https://stackoverflow.com/a/109025/733637) by [Matt Howells](https://stackoverflow.com/users/16881/matt-howells). — CiaPan, Jul 19 '17 at 13:10
For what it's worth, x86 compilers that implement `__builtin_popcount` will emit a `popcnt` instruction if the target processor supports it, *or* they will fall back to the fast parallel bit-counting algorithm presented by Matt Howells in the answers that @CiaPan linked. So there is never really a reason to code that algorithm yourself, unless you're on a compiler that doesn't have a built-in for population count. Clearly this same optimization is applied to `std::bitset.count`, at least in the compiler Richard Hodges tested with. — Cody Gray, Jul 19 '17 at 13:56
Have a look at [bitset2](https://github.com/ClaasBontus/bitset2) for a constexpr implementation of count. — Claas Bontus, Jul 20 '17 at 07:34
I've got this question in google phone interview :-) The hint of the question was "the memory consumption of the approach doesn't matter". I haven't got the hint, and started with iteration and C++ bitsets. The look-up table was what they wanted. — Valentin Heinitz, Jul 27 '17 at 10:59

score 86 · Accepted Answer · edited Jun 20 '20 at 09:12

Why not use the standard library?

#include <bitset>

int bits_in(std::uint64_t u)
{
    auto bs = std::bitset<64>(u);
    return bs.count();
}

resulting assembler (Compiled with -O2 -march=native):

bits_in(unsigned long):
        xor     eax, eax
        popcnt  rax, rdi
        ret

It is worth mentioning at this point that not all x86 processors have this instruction so (at least with gcc) you will need to let it know what architecture to compile for.

@tambre mentioned that in reality, when it can, the optimiser will go further:

volatile int results[3];

int main()
{
    results[0] = bits_in(255);
    results[1] = bits_in(1023);
    results[2] = bits_in(0x8000800080008000);   
}

resulting assembler:

main:
        mov     DWORD PTR results[rip], 8
        xor     eax, eax
        mov     DWORD PTR results[rip+4], 10
        mov     DWORD PTR results[rip+8], 4
        ret

Old-school bit-twiddlers like me need to find new problems to solve :)

Update

Not everyone was happy that the solution relies on cpu help to compute the bitcount. So what if we used an autogenerated table but allowed the developer to configure the size of it? (warning - long compile time for the 16-bit table version)

#include <utility>
#include <cstdint>
#include <array>
#include <numeric>
#include <bitset>


template<std::size_t word_size, std::size_t...Is>
constexpr auto generate(std::integral_constant<std::size_t, word_size>, std::index_sequence<Is...>) {
    struct popcount_type {
        constexpr auto operator()(int i) const {
            int bits = 0;
            while (i) {
                i &= i - 1;
                ++bits;
            }
            return bits;
        }
    };
    constexpr auto popcnt = popcount_type();

    return std::array<int, sizeof...(Is)>
            {
                    {popcnt(Is)...}
            };
}

template<class T>
constexpr auto power2(T x) {
    T result = 1;
    for (T i = 0; i < x; ++i)
        result *= 2;
    return result;
}


template<class TableWord>
struct table {
    static constexpr auto word_size = std::numeric_limits<TableWord>::digits;
    static constexpr auto table_length = power2(word_size);
    using array_type = std::array<int, table_length>;
    static const array_type& get_data() {
        static const array_type data = generate(std::integral_constant<std::size_t, word_size>(),
                                           std::make_index_sequence<table_length>());
        return data;
    };

};

template<class Word>
struct use_table_word {
};

template<class Word, class TableWord = std::uint8_t>
int bits_in(Word val, use_table_word<TableWord> = use_table_word<TableWord>()) {
    constexpr auto table_word_size = std::numeric_limits<TableWord>::digits;

    constexpr auto word_size = std::numeric_limits<Word>::digits;
    constexpr auto times = word_size / table_word_size;
    static_assert(times > 0, "incompatible");

    auto reduce = [val](auto times) {
        return (val >> (table_word_size * times)) & (power2(table_word_size) - 1);
    };

    auto const& data = table<TableWord>::get_data();
    auto result = 0;
    for (int i = 0; i < times; ++i) {
        result += data[reduce(i)];
    }
    return result;
}

volatile int results[3];

#include <iostream>

int main() {
    auto input = std::uint64_t(1023);
    results[0] = bits_in(input);
    results[0] = bits_in(input, use_table_word<std::uint16_t>());

    results[1] = bits_in(0x8000800080008000);
    results[2] = bits_in(34567890);

    for (int i = 0; i < 3; ++i) {
        std::cout << results[i] << std::endl;
    }
    return 0;
}

Final Update

This version allows the use of any number of bits in the lookup table and supports any input type, even if it's smaller than the number of bits in the lookup table.

It also short-circuits if the high bits are zero.

#include <utility>
#include <cstdint>
#include <array>
#include <numeric>
#include <algorithm>

namespace detail {
    template<std::size_t bits, typename = void>
    struct smallest_word;

    template<std::size_t bits>
    struct smallest_word<bits, std::enable_if_t<(bits <= 8)>>
    {
        using type = std::uint8_t;
    };

    template<std::size_t bits>
    struct smallest_word<bits, std::enable_if_t<(bits > 8 and bits <= 16)>>
    {
        using type = std::uint16_t;
    };

    template<std::size_t bits>
    struct smallest_word<bits, std::enable_if_t<(bits > 16 and bits <= 32)>>
    {
        using type = std::uint32_t;
    };

    template<std::size_t bits>
    struct smallest_word<bits, std::enable_if_t<(bits > 32 and bits <= 64)>>
    {
        using type = std::uint64_t;
    };
}

template<std::size_t bits> using smallest_word = typename detail::smallest_word<bits>::type;

template<class WordType, std::size_t bits, std::size_t...Is>
constexpr auto generate(std::index_sequence<Is...>) {

    using word_type = WordType;

    struct popcount_type {
        constexpr auto operator()(word_type i) const {
            int result = 0;
            while (i) {
                i &= i - 1;
                ++result;
            }
            return result;
        }
    };
    constexpr auto popcnt = popcount_type();

    return std::array<word_type, sizeof...(Is)>
            {
                    {popcnt(Is)...}
            };
}

template<class T>
constexpr auto power2(T x) {
    return T(1) << x;
}

template<std::size_t word_size>
struct table {

    static constexpr auto table_length = power2(word_size);

    using word_type = smallest_word<word_size>;

    using array_type = std::array<word_type, table_length>;

    static const array_type& get_data() {
        static const array_type data = generate<word_type, word_size>(std::make_index_sequence<table_length>());
        return data;
    };

    template<class Type, std::size_t bits>
    static constexpr auto n_bits() {
        auto result = Type();
        auto b = bits;
        while(b--) {
            result = (result << 1) | Type(1);
        }
        return result;
    };

    template<class Uint>
    int operator()(Uint i) const {
        constexpr auto mask = n_bits<Uint, word_size>();
        return get_data()[i & mask];
    }

};

template<int bits>
struct use_bits {
    static constexpr auto digits = bits;
};

template<class T>
constexpr auto minimum(T x, T y) { return x < y ? x : y; }

template<class Word, class UseBits = use_bits<8>>
int bits_in(Word val, UseBits = UseBits()) {

    using word_type = std::make_unsigned_t<Word>;
    auto uval = static_cast<word_type>(val);


    constexpr auto table_word_size = UseBits::digits;
    constexpr auto word_size = std::numeric_limits<word_type>::digits;

    auto const& mytable = table<table_word_size>();
    int result = 0;
    while (uval)
    {
        result += mytable(uval);
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wshift-count-overflow"
                uval >>= minimum(table_word_size, word_size);
#pragma clang diagnostic pop
    }

    return result;
}

volatile int results[4];

#include <iostream>

int main() {
    auto input = std::uint8_t(127);
    results[0] = bits_in(input);
    results[1] = bits_in(input, use_bits<4>());
    results[2] = bits_in(input, use_bits<11>());
    results[3] = bits_in(input, use_bits<15>());

    for (auto&& i : results) {
        std::cout << i << std::endl;
    }

    auto input2 = 0xabcdef;
    results[0] = bits_in(input2);
    results[1] = bits_in(input2, use_bits<4>());
    results[2] = bits_in(input2, use_bits<11>());
    results[3] = bits_in(input2, use_bits<15>());

    for (auto&& i : results) {
        std::cout << i << std::endl;
    }

    auto input3 = -1;
    results[0] = bits_in(input3);
    results[1] = bits_in(input3, use_bits<4>());
    results[2] = bits_in(input3, use_bits<11>());
    results[3] = bits_in(input3, use_bits<15>());

    for (auto&& i : results) {
        std::cout << i << std::endl;
    }

    return 0;
}

example output:

The resulting assembly output for a call to bits_in(int, use_bits<11>()) for example, becomes:

.L16:
        mov     edx, edi
        and     edx, 2047
        movzx   edx, WORD PTR table<11ul>::get_data()::data[rdx+rdx]
        add     eax, edx
        shr     edi, 11
        jne     .L16

Which seems reasonable to me.

I think the resulting assembly example is kinda dishonest, because you don't actually use it in the exampl ~~, so of course it'd be optimized to nothing~~. — tambre, Jul 19 '17 at 12:04
It is much better as it saved lots of CPU cycle and L2 cache — Dennis C, Jul 19 '17 at 12:06
@tambre: Can you elaborate on in what sense the code gets optimized out? The emitted code sure looks like a normal population count function to me. — doynax, Jul 19 '17 at 12:06
@Richard Hodges: The compiler is capable of optimizing out the constant case. That is nice but the coverage of the general case above insures an effective worst-case, at least on this particular CPU. I certainly cannot see how the original general case could be construed as _dishonest_. — doynax, Jul 19 '17 at 12:14
It's stunning the optimizer is able to figure out `bits_in` is nothing but `return __builtin_popcountll(u)` but, not only, can even compute that at compile time. That's why intrinsics are su much better over inline asm, when possible. NB: `bitset::count` returns `size_t`. — edmz, Jul 19 '17 at 12:47
This was the question: "Suppose that I want to create a compile time constructed bit count lookup table for 64bit integers in 16 bit chunks". This is not an answer to this question. You can mention this solution as an alternative, but it is not an answer. Too bad that this answer is the most upvoted one, — geza, Jul 19 '17 at 13:34
Note that while GCC and Clang will make this optimization (emitting `popcnt` if available, or falling back to the fast parallel bit-counting algorithm mentioned in the comments on the question), MSVC does *not* know about this (as of VS 2015), and there's no optimizer switch that can instruct it to do so (e.g., `/arch:AVX2` isn't taken to imply `popcnt` support). If you're using MSVC, and you don't want slow, one-byte-at-a-time counting code, you will need to do something different. — Cody Gray, Jul 19 '17 at 14:01
@geza: StackOverflow favors solving the problem over answering the question as asked, notably because many questions suffers from the X/Y problems. It's more likely that the OP is trying to find a fast way to count the bits, rather than being deadset on using a 16-bits table method (and why 16-bits and not 8-bits?). If the OP were to clarify they absolutely want to use a table, even if it's slower, then it would be different... and a rather surprising question. — Matthieu M., Jul 19 '17 at 15:27
@MatthieuM.: the problem is written in the question. The question wasn't "how can I count bits fast on x86?" or "is there a stdlib function for counting bits"?. What if compiler doesn't optimize this (as Cody says, this is the case for MSVC). What if he is not on x86? I kind of dislike answers which doesn't answer the question, but a hypothetical problem, which is connected to the question (look at the other upvoted answer, it is actually an answer). I see that this is a trend here in SO, and I don't think that it is a good thing. But this can be discussed further in meta, I think. — geza, Jul 19 '17 at 15:40
@MatthieuM.: and I think a table method may be viable on certain machines. A smaller table would be better (so it fits in L1). Actually, several years ago, I've faced a similar problem for 32-bit numbers, and a 11-bit table was the fastest method available. — geza, Jul 19 '17 at 15:53
@geza I have taken your comments to heart and provided a compile-time table solution with code-configurable table size. — Richard Hodges, Jul 19 '17 at 16:47
@RichardHodges: The use of `int` as element of the array is clearly suboptimal. The count cannot easily exceed 256, so might as well use a byte (at least), rather than 4 bytes per entry! — Matthieu M., Jul 19 '17 at 16:48
@MatthieuM. completely agree. In fact, I think it would be better to template-parmeterise on the bit-length of the table entries and compute appropriate types from there.... but i've got work to do :) — Richard Hodges, Jul 19 '17 at 16:50
In the table answer, why define a power-of-two function, rather than just using leftshifts? — Jacob Manaker, Jul 19 '17 at 16:56
@JacobManaker i used a function to make the code more expressive. You're right though, a better implementation would use leftshifts. — Richard Hodges, Jul 19 '17 at 17:01
@RichardHodges: Thanks for that! Just to be clear, I didn't wanted to debunk your answer, it was useful already, but it is more useful now, I think. Just a nitpicking comment from me: maybe you should mention that popcnt instruction is not available on every x86 (I've checked, and it is available from Nehalem). GCC for example, uses a method similar to Akira's answer, if popcnt is not available (and this method was actually slower for 32-bit numbers, than using a 11-bit table with 3 reads, if the table is in the cache - I've checked this several years ago, cannot remember the exact CPU). — geza, Jul 19 '17 at 17:13
@geza no offence take :) what started out as a simple question actually turns out to be a fascinating one. You know how it is - I won't be able to sleep now until I've written one based on variable bit-lengths with a type trait to compute the most efficient word size... — Richard Hodges, Jul 19 '17 at 17:19
@geza the question clearly asks “Is there a modern (C++11/14) way to obtain the same result?” and that has been answered here. Even if the target CPU hasn’t a `popcnt` instruction, it is reasonable to assume the the “modern” compiler will optimize the `std::bitset` approach to something that is at least on par with the lookup table approach. Most notably, because the compiler already knows, which of [these alternatives](http://www.hackersdelight.org/hdcodetxt/pop.c.txt) is the best for the particular target platform… — Holger, Jul 19 '17 at 17:51
@Holger: Yes, it is reasonable to assume that, but unfortunately, it is not the case. For example, on my system, GCC generates a library call (which is in a .so, so the call itself is slow) for __builtin_popcount. For 32-bit, I'm absolutely sure that it is slower than the table based method. For 64-bit, I think it is slower too, but I'd have to measure it. Compilers usually don't use tables for this kind of things (of course this depends on the exact conditions. If only several popcounts are needed, and the table is not in cache, then GCC's solution may win) — geza, Jul 19 '17 at 18:02
@Holger: I've just checked, table based method for 32-bit numbers is 2.68x faster than the GCC'd method (I've used a simple benchmark: summed popcount of numbers 0->2000000000). And table based method is 2.23x slower than popcount instruction. — geza, Jul 19 '17 at 18:17
@geza: usually the libgcc helper-functions gcc calls for popcount, extended-precision division, and so on, are statically linked (from `libgcc.a`). But yes, especially for 64-bit inputs on machines with a fast multiply, the bit-hack approach is probably better than a byte-LUT (or an 11-bit LUT). I was disappointed to see that's what gcc used for `__builtin_popcnt` without `-mpopcnt` (or `-msse4.2`). clang can even auto-vectorize popcount with AVX2 `vpshufb`, which is cool: https://godbolt.org/g/pYuDHZ. Related: bottom of this answer https://stackoverflow.com/a/45392171/224132 for more SIMD — Peter Cordes, Dec 15 '17 at 03:54
@RichardHodges: Your templated LUT only ever needs `uint8_t` entries. `popcount(uint64_t)` is in the range `[0..64]`, and that would require a table with `2^64` entries... BTW, branch misses are expensive, so looping over *all* the bits unconditionally might be cheaper unless the table is small or the integers almost always have only their low few bits set. It's probably good for the compiler to fully-unroll a 3-iteration loop. Especially with BMI2 `rorx` to copy+rotate and save a `mov` instruction. — Peter Cordes, Dec 15 '17 at 04:09
At the top of your answer, you didn't *just* compile with `-O2` to get `popcnt`. You must have also used `-march=native` or `-march=nehalem`, or `-msse4.2`. Much code is unfortunately still compiled for baseline x86-64. You should definitely clarify that you don't get `popcnt` "by default". — Peter Cordes, Dec 15 '17 at 04:11
Are you planning to update the LUT? `movzx edx, WORD PTR table<11ul>::get_data()::data[rdx+rdx]` only needs to be `byte ptr`. It's kind of a shame to throw out all that clever type-deducing code, but it's really not needed for this problem. (And if you could get compilers to emit it, `add al, [data[rdx+rdx]` instead of movzx/add would save an instruction without introducing [partial-register penalties](https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers) even on Core2/Nehalem, as long as `eax` is xor-zeroed outside the loop.) — Peter Cordes, Dec 15 '17 at 08:19
@PeterCordes :) Fair point, the bit-count is never going to overflow one byte in any architecture available to me. I could happily spend all day micro-optimising the code but I have a product to demo today - emscripten-transpiled c++ running in an angular 5 web application. More than enough work for one man. I'll revisit this later. — Richard Hodges, Dec 15 '17 at 08:53
Strangely, in my tests with g++ `int bits_in()` performed slightly worse than `unsigned int bits_in()`. So you might want to try change the function's return type if performance is critical. — Andriy Makukha, Jan 22 '19 at 07:41

DAle · Answer 2 · 2017-07-20T12:59:08.593

This is a C++14 solution, built basically around the usage of constexpr:

// this struct is a primitive replacement of the std::array that 
// has no 'constexpr reference operator[]' in C++14 
template<int N>
struct lookup_table {
    int table[N];

    constexpr int& operator[](size_t i) { return table[i]; }
    constexpr const int& operator[](size_t i) const { return table[i]; }
};

constexpr int bit_count(int i) { 
    int bits = 0; 
    while (i) { i &= i-1; ++bits; } 
    return bits;
}

template<int N> 
constexpr lookup_table<N> generate() {
    lookup_table<N> table = {};

    for (int i = 0; i < N; ++i)
        table[i] = bit_count(i);

    return table;
}

template<int I> struct Check {
    Check() { std::cout <<  I << "\n"; }
};

constexpr auto table = generate<65536>();

int main() {
    // checks that they are evaluated at compile-time 
    Check<table[5]>();
    Check<table[65535]>();
    return 0;
}

Runnable version: http://ideone.com/zQB86O

Is there any particular reason as to why, in the `const` `operator[]` overload, the primitive (`constexpr`) return type is by reference rather than by value? I believe overloading the array subscript operator generally recommends value return for the `const` variant in case the return is a primitive(/built-in) type, but I'm not well-versed with `constexpr` in contexts such as this one. Nice answer! — dfrib, Jul 19 '17 at 18:04
@dfri, thanks! No, there was no particular reason, it was a 'copy' of the `std::array` generic operator and I believe could be changed to a value return. — DAle, Jul 19 '17 at 18:42

Akira · Answer 3 · 2017-07-20T09:40:00.717

With c++17 you can use constexpr to construct the lookup table in compile time. With population count calculation the lookup table can be contructed as follows:

#include <array>
#include <cstdint>

template<std::size_t N>
constexpr std::array<std::uint16_t, N> make_lookup() {
    std::array<std::uint16_t, N> table {};

    for(std::size_t i = 0; i < N; ++i) {
        std::uint16_t popcnt = i;

        popcnt = popcnt - ((popcnt >> 1) & 0x5555);
        popcnt = (popcnt & 0x3333) + ((popcnt >> 2) & 0x3333);
        popcnt = ((popcnt + (popcnt >> 4)) & 0x0F0F) * 0x0101;

        table[i] = popcnt >> 8;
    }
    return table;
}

Sample usage:

auto lookup = make_lookup<65536>();

The std::array::operator[] is constexpr since c++17, with c++14 the example above compiles but won't be a true constexpr.

If you like to punish your compiler, you can initialize the resulting std::array with variadic templates too. This version will work with c++14 too and even with c++11 by using the indices trick.

#include <array>
#include <cstdint>
#include <utility>

namespace detail {
constexpr std::uint8_t popcnt_8(std::uint8_t i) {
    i = i - ((i >> 1) & 0x55);
    i = (i & 0x33) + ((i >> 2) & 0x33);
    return ((i + (i >> 4)) & 0x0F);
}

template<std::size_t... I>
constexpr std::array<std::uint8_t, sizeof...(I)>
make_lookup_impl(std::index_sequence<I...>) {
    return { popcnt_8(I)... };
}
} /* detail */

template<std::size_t N>
constexpr decltype(auto) make_lookup() {
    return detail::make_lookup_impl(std::make_index_sequence<N>{});
}

Note: In the example above I switched to the 8-bit integers from 16-bit integers.

Assembly Output

The 8-bit version will make only 256 template arguments for detail::make_lookup_impl function instead of 65536. The latter is too much and will exceed the template instantiation depth maximum. If you have more than enough virtual memory, you can increase this maximum with -ftemplate-depth=65536 compiler flag on GCC and switch back to 16-bit integers.

Anyway, take a look into the following demo and try it how the 8-bit version counts the set bits of a 64-bit integer.

Live Demo

In C++14 `std::array::operator[]` is not `constexpr`, and it seems this code will be evaluated at a compile-time only in C++17. That's why I did not use `std::array` in my example. — DAle, Jul 19 '17 at 13:23
You can get this to work in c++14 by making table a C array, implementing c++17's `std::to_array`, and returning `to_array(table)`. — Erroneous, Jul 19 '17 at 19:04
@Erroneous, it's a good idea but unfortunately in this case it will produce a lot of template arguments (namely 65536) and it will exceed the template instantiation depth maximum. This maximum can be increased with `-ftemplate-depth=65536` compiler flag but it has a serious negative impact on compilation time. — Akira, Jul 19 '17 at 20:23
@Akira I didn't get any issues on gcc 7.1.1. I used the implementation from http://en.cppreference.com/w/cpp/experimental/to_array and compiled with `-std=c++14 -ftemplate-depth=256`. — Erroneous, Jul 19 '17 at 20:34
@Erroneous, now try the same with `-ftemplate-depth=65536` and with an array of about 60000 elements. :) Anyway I added an other approach to my answer based on your suggestion. — Akira, Jul 19 '17 at 22:36
all pretty cool, though as @DAle mentioned, if you just return a plain array then you don't even need the extra templates. I need at least a depth of 26 for this to compile on my system. — Erroneous, Jul 20 '17 at 13:15

score 2 · Answer 4 · answered Jul 30 '17 at 07:19

One more for posterity, creating a lookup table using a recursive solution (of log(N) depth). It makes use of constexpr-if and constexpr-array-operator[], so it's very much C++17.

#include <array>

template<size_t Target, size_t I = 1>
constexpr auto make_table (std::array<int, I> in = {{ 0 }})
{
  if constexpr (I >= Target)
  {
    return in;
  }
  else
  {
    std::array<int, I * 2> out {{}};
    for (size_t i = 0; i != I; ++i)
    {
      out[i] = in[i];
      out[I + i] = in[i] + 1;
    }
    return make_table<Target> (out);
  }
}

constexpr auto population = make_table<65536> ();

See it compile here: https://godbolt.org/g/RJG1JA

Bit count : preprocessor magic vs modern C++

4 Answers4

Update

Final Update