18

what's the best way to perform bitwise operations on vector<bool>?

as i understand, vector<bool> is a specialisation that uses one bit per boolean. I chose vector<bool> for memory saving reasons. I know that there are some problems with vector<bool> but for my needs it's apropriate.

now - what's the most performant way of aplying bitwise operations to whole such vectors?

if i do it in a for loop and readout each single bool and store it back, the way I understand it a lot more operations are performed inside in order to access the actual values.

thanks!

Mat
  • 2,473
  • 5
  • 23
  • 24

4 Answers4

11

If the number of bits are fixed at compile time, you would be much better off using std::bitset

If not, (i.e. number of bits varies at runtime), then you should see and can use boost::dynamic_bitset)

In both of these, it is extremely easy to do all the bitwise operations.

Arun
  • 18,092
  • 8
  • 46
  • 58
  • +1 because I'd use these libraries if I decided I could take the dependencies on for the project I was working on. – Merlyn Morgan-Graham Oct 29 '10 at 03:46
  • The provided link to the bitset documentation changes on occasion and is no longer valid. There is a link to the current latest version and will become invalid: http://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.5/a00396.html – Jimbo Feb 22 '11 at 20:09
  • Bitwise ops on either of those do auto-vectorize with gcc and clang, doing 16 or 32 bytes at a time. They're very good. https://godbolt.org/g/cnRFEG (But `boost::dynamic_bitset` has a fairly poor `.count()`, using a byte-LUT even when hardware `popcnt` and AVX2 instructions are available. `std::bitset` has good `.count()` performance, too.) – Peter Cordes Jul 31 '17 at 05:14
4

Ignoring the title of your question, lets answer this question, instead:

what's the best way to perform bitwise operations on vector?

The best way is to define your vector as vector<unsigned char> (or vector<uint32_t>, or whichever other integer type you choose), and do your bitwise operations how you normally would for an array of unsigned integers. Things will be much faster this way, and there will be no hidden mechanism.

You can use division (or bitwise operators, if you're slick) to resolve which array index you need to operate against, and for-loops to apply bitwise operations larger than a single element.

Here's a related question: Bit twiddling a lot of bits in C

You will basically be doing these same operations, if and when you decide to wrap vector<unsigned some-int-type> with your own operators.

Community
  • 1
  • 1
Merlyn Morgan-Graham
  • 54,918
  • 14
  • 119
  • 174
  • so if I still want the possibility to access my boolean values directly I actually have to encapsule vector and do the mechanics vector does myself, just additionally also providing functionality for bitwise operations? – Mat Oct 29 '10 at 03:32
  • @Mat: You don't *have* to do anything, but to get the best performance, I do suggest doing what you just described. Wrap the whole thing in a class, and implement overloaded operators against it. Then you can optimize to your heart's content. It really isn't very complicated code to implement, and gives you the right balance of abstraction and performance. – Merlyn Morgan-Graham Oct 29 '10 at 03:36
  • actually i just wanted to know by that whether this functionality doesn't maybe already exist? because it appears logical to me that this is something you want to do with bitvectors – Mat Oct 29 '10 at 03:39
  • @Mat: I think I've read that people regard the specialization on `vector` as having been a bad idea, so the standards committee might have decided not to throw good money after bad. I wouldn't be too surprised if there was a third party library, but there is nothing standard. – Merlyn Morgan-Graham Oct 29 '10 at 03:43
  • @Mat: If you can use a third party library, go for ArunSaha's answer :) – Merlyn Morgan-Graham Oct 29 '10 at 03:46
3

I read both these answers, but just wanted a quick solution, and implemented something horrible.

You can make the bitwise operators work on vector<bool>, but the code has to be specialized for the c++ standard library implementation or fall back to the slow form. Here's my operator| for GNU libstdc++-v3:

std::vector<bool> operator|(std::vector<bool> A, const std::vector<bool>& B)
{
    if (A.size() != B.size())
        throw std::invalid_argument("differently sized bitwise operands");

    std::vector<bool>::iterator itA = A.begin();
    std::vector<bool>::const_iterator itB = B.begin();

    // c++ implementation-specific
    while (itA < A.end())
        *(itA._M_p ++) |= *(itB._M_p ++); // word-at-a-time bitwise operation

    return A;
}

This is of course pretty bad. Somebody updates GCC, the new version stores things differently, and your code breaks for no apparent reason.

fuzzyTew
  • 2,860
  • 22
  • 23
  • This unfortunately doesn't auto-vectorize with gcc or clang for x86-64, with `-O3 -march=haswell`. https://godbolt.org/g/eJnfKp. It should be using AVX2 to `vpor` 32 bytes at a time, instead of only 8 at a time with scalar `or`. Boost's `dynamic_bitset<>` does auto-vectorize for `a & b`. But unfortunately boost's `.count()` function isn't very efficient. (But much better than `std::count` on `vector`, unless you use `clang -stdlib=libc++` in which case that vectorizes with AVX2.) – Peter Cordes Jul 31 '17 at 04:44
  • Anyway, I haven't found a way to get good code for a popcount of the result of a boolean op between two dynamically-sized bit-vectors, other than manually vectorizing it of course. – Peter Cordes Jul 31 '17 at 04:46
  • @PeterCordes - `clang` can be [coaxed](https://godbolt.org/g/UJvAez) into producing some pretty good for bit-ops (`and` in this example) using a temporary array on the stack. Storing to the stack and then reloading isn't actually a terrible way to get 4 values from a `ymm` into a place it can be used by `popcnt`. The `mov r9, rcx; or r9, 8` ops interspersed with the vectorized and is a bit silly and may cost a little bit of throughput (why didn't they just use an offset in the addressing modes, or at least do it with a single `add`?). – BeeOnRope Jul 31 '17 at 21:15
  • The don't unroll the `popcnt` enough to avoid the latency due to the false dep (and add an extra cycle due to the way they accumulate it), but it probably gets to better than 50% of the performance of a hand-tuned one, which is pretty good. For the hand-tuned one, it isn't clear to me if a `pshufb` nibble counting approach is better or a `popcnt` one, or perhaps a combination of the two. Probably the former since the output is already in a `ymm`. – BeeOnRope Jul 31 '17 at 21:25
  • @BeeOnRope: On Haswell/Skylake, AVX2 unpacking to nibbles then `vpshufb` -> `vpaddb` is faster than scalar `popcnt` for large arrays. (Every 255 or fewer iterations, use `vpsadbw` to hsum to 64-bit). Especially [when you want `popcnt(A[] & B[])`](https://stackoverflow.com/questions/45376006/large-0-1-matrix-multiplication-using-bitwise-and-and-popcount-instead-of-actu), AVX2 is a big win. But getting auto-vectorized C++ to compile to something that does `vpand`->shift/mask->2x`vpshufb`->`vpaddb` without a store/reload between the AND and popcount may not be possible. – Peter Cordes Jul 31 '17 at 21:44
  • @BeeOnRope: [This blog post](http://wm.ite.pl/articles/sse-popcount.html) has HSW and SKL numbers for AVX2 vpshufb vs. `popcnt` on various array sizes, for just the popcount problem, not the AND of two bitmaps on the fly. Using that guy's hand-written asm, AVX2 is 25% faster for large buffers. – Peter Cordes Jul 31 '17 at 21:51
  • For what it's worth, `clang` does change `__builtin_popcount` to an AVX version in some cases, but not in the code I linked above. It's not totally surprising that the AVX2 popcnt is faster since has a max throughput of 128-bits per cycle, limited by the single `pshufb` (32 nibbles) versus 64-bits per cycle for `popcnt` limited by it only issuing on one port. Of course, on Ryzen this is reversed and `popcnt` can issue 4-wide for 256-bits per cycle. So the `clang` code probably isn't too terrible there (and Ryzen may not have the false dep). @PeterCordes – BeeOnRope Aug 01 '17 at 00:35
  • 1
    @BeeOnRope: Ryzen can still only do 2 loads per cycle, so the trick is keeping it fed (especially if you want to popcount the AND of two bitmaps). Maybe with a mix of scalar loads and `movdqa`->`movq`+`pextrq`? Hmm, `pextrq` is 2 uops. Maybe just a hybrid strategy of `popcnt` + AVX2, interleaved in the right ratio to max out ALU throughput with AVX2, and max out load throughput with `popcnt rax, [mem]`+`add rdx, rax`. – Peter Cordes Aug 01 '17 at 00:47
  • @PeterCordes - right, good point. Well we can say that at least `popcnt` can easily process 128-bits/cycle from memory at the cost of 2 ops, so it's probably always better than a pure AVX2 approach (except when the inputs are already in AVX2 regs), but you might be able to squeeze out a bit more throughput with a hybrid approach. – BeeOnRope Aug 01 '17 at 02:08
0

This one should work too.

std::vector<bool> v3(v1.size());
std::transform(v1.begin(), v1.end(), 
               v2.begin(), v3.begin(), std::logical_and<bool>());
SingerOfTheFall
  • 27,171
  • 7
  • 60
  • 100
  • 1
    This will be incredibly inefficient because each bit needs to be extracted from its word, then transformed, then inserted back into the word. The solutions that operate on a word at a time will be much more efficient. – Joel Mar 03 '15 at 17:08
  • @Joel: Standard libraries [can and should specialize their `std` algorithm templates for `vector`](https://isocpp.org/blog/2012/11/on-vectorbool). libc++ does for some things (e.g. `std::count` on `vector` auto-vectorizes to very good AVX2 code). But in this case, clang and gcc with libc++ and libstdc++ both make *terrible* code, looping a bit at a time and not even doing the AND efficiently. (Testing the bit separately in each input, and using conditional branches! https://godbolt.org/g/4MDd8v) (Even if you use `std::bit_and` instead of `logical_and`). – Peter Cordes Jul 31 '17 at 05:05
  • 1
    Anyway, this answer isn't technically bad, it just exposes MASSIVE missed-optimizations by compilers and standard library writers. Which means in practice you shouldn't use it yet. – Peter Cordes Jul 31 '17 at 05:06