C++11 vector performance issue (with code example)

Question

I notice that vector is much slower than bool array when running the following code.

int main() 
{
    int count = 0;
    int n = 1500000;
    // slower with c++ vector<bool>
    /*vector<bool> isPrime;
    isPrime.reserve(n);
    isPrime.assign(n, true);
    */
    // faster with bool array 
    bool* isPrime = new bool[n];

    for (int i = 0; i < n; ++i)
        isPrime[i] = true;


    for (int i = 2; i< n; ++i) {
        if (isPrime[i])
            count++;
        for (int j =2; i*j < n; ++j )
            isPrime[i*j] = false;
    }

    cout <<  count << endl;
    return 0;
}

Is there some way that I can do to make vector<bool> faster ? Btw, both std::vector::push_back and std::vector::emplace_back are even slower than std::vector::assign.

you are accessing `isPrime` beyond its end, it should be `new bool[n]` — Karsten Koop, Apr 29 '16 at 07:55
Don't use `vector` if you're super-concerned about performance. It's required by the standard to be very space efficient, and that has a performance cost. — David Schwartz, Apr 29 '16 at 07:59
How much of a slowdown are you talking about? You might want to add some timing examples to make this question more appealing. — anderas, Apr 29 '16 at 07:59

manlio · Answer 1 · 2016-05-03T17:26:54.033

16

std::vector<bool> can have various performance issues (e.g. take a look at https://isocpp.org/blog/2012/11/on-vectorbool).

In general you can:

use std::vector<std::uint8_t> instead of std::vector<bool> (give a try to std::valarray<bool> also).

This requires more memory and is less cache-friendly but there isn't a overhead (in the form of bit manipulation) to access a single value, so there are situations in which it works better (after all it's just like your array of bool but without the nuisance of memory management)
use std::bitset if you know at compile time how large your boolean array is going to be (or if you can at least establish a reasonable upper bound)
if Boost is an option try boost::dynamic_bitset (the size can be specified at runtime)

But for speed optimizations you have to test...

With your specific example I can confirm a performance difference only when optimizations are turned off (of course this isn't the way to go).

Some tests with g++ v4.8.3 and clang++ v3.4.5 on an Intel Xeon system (-O3 optimization level) give a different picture:

                    time (ms)
                 G++      CLANG++
array of bool    3103     3010
vector<bool>     2835     2420    // not bad!
vector<char>     3136     3031    // same as array of bool
bitset           2742     2388    // marginally better

(time elapsed for 100 runs of the code in the answer)

std::vector<bool> doesn't look so bad (source code here).

edited May 03 '16 at 17:26

answered Apr 29 '16 at 08:29

manlio

16,658
13
67
107

1

"Xeon" can be anything from P4 to Skylake. Saying *which* Xeon (e.g. Haswell Xeon, or Exxxx v3) would be much more informative. Those are pretty old compiler versions for modern hardware, too (not as big a deal if you aren't auto-vectorizing or using `-march=native`). – Peter Cordes Apr 30 '16 at 18:54
1

@PeterCordes you're right. It's a Xeon e3-1230v3 (using `-march=native` switch). In my defence I would add that it wasn't meant to be a complete examination, but I'll add add test code and some more results as soon as I can. – manlio Apr 30 '16 at 20:50
You don't have to make a big deal out of timing this. Just any time you post any perf numbers, the specific microarch and compiler version + options are essential. e.g. clang++ 3.4.5 on a Haswell Xeon, `-O3 -march=native` would cover it. BTW, clang 3.7.1 (current stable) auto-vectorizes for AVX2 significantly better than 3.4, in terms of the quality of the asm. IDK how often that makes a perf diff, since memory bottlenecks often hide CPU bottlenecks. http://llvm.org/apt/ has current versions of clang for debian-based Linux distros. – Peter Cordes Apr 30 '16 at 22:29
I confirm that the performance benefit of `std::vector` is not only observed on x86 processors, but also on my raspberry pi with gcc 8.3, with just a simple `-O2` switch. It's like 0.5s vs. 2.5s difference, with `std::vector` being 0.5s. – Samuel Li Jan 01 '20 at 21:34

Mohit Jain · Answer 2 · 2016-04-29T07:59:15.950

10

vector<bool> may have a template specialization and may be implemented using bit array to save space. Extracting and saving a bit and converting it from / to bool may cause the performance drop you are observing. If you use std::vector::push_back, you are resizing the vector which will cause even worse performance. Next performance killer may be assign (Worst complexity: Linear of first argument), instead use operator [] (Complexity: constant).

On the other hand, bool [] is guaranteed to be array of bool.

And you should resize to n instead of n-1 to avoid undefined behaviour.

edited Apr 29 '16 at 07:59

answered Apr 29 '16 at 07:54

Mohit Jain

29,414
8
65
93

9

It not only "may have" such a specialization, this is actually in the standard! – anderas Apr 29 '16 at 07:57
2

One workaround is to use `deque`. Or `vector`. :) – Cheers and hth. - Alf Apr 29 '16 at 07:57
@Mohit Jain: template specialization should happen at compile time. It should affect run time performance, isn't it ? – guoqing Apr 29 '16 at 09:45
2

@guoqing Implication of specialization would affect run time performance (for good or bad). In this case it may be good for space requirement and bad for execution time. – Mohit Jain Apr 29 '16 at 10:38
1

@guoqing You choose between `vector` and `list` before compile time, don't you? Yet it affects runtime performance. – user253751 Apr 29 '16 at 11:49

score 6 · Answer 3 · answered Apr 29 '16 at 18:11

vector<bool> can be high performance, but isn't required to be. For vector<bool> to be efficient, it needs to operate on many bools at a time (e.g. isPrime.assign(n, true)), and the implementor has had to put loving care into it. Indexing individual bools in a vector<bool> is slow.

Here is a prime finder that I wrote a while back using vector<bool> and clang + libc++ (the libc++ part is important):

#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>

std::vector<bool>
init_primes()
{
    std::vector<bool> primes(0x80000000, true);
    primes[0] = false;
    primes[1] = false;
    const auto pb = primes.begin();
    const auto pe = primes.end();
    const auto sz = primes.size();
    size_t i = 2;
    while (true)
    {
        size_t j = i*i;
        if (j >= sz)
            break;
        do
        {
            primes[j] = false;
            j += i;
        } while (j < sz);
        i = std::find(pb + (i+1), pe, true) - pb;
    }
    return primes;
}

int
main()
{
    using namespace std::chrono;
    using dsec = duration<double>;
    auto t0 = steady_clock::now();
    auto p = init_primes();
    auto t1 = steady_clock::now();
    std::cout << dsec(t1-t0).count() << "\n";
}

This executes for me in about 28s (-O3). When I change it to return a vector<char> instead, the execution time goes up to about 44s.

If you run this using some other std::lib, you probably won't see this trend. On libc++ algorithms such as std::find have been optimized to search a word of bits at a time, instead of bit at a time.

See http://howardhinnant.github.io/onvectorbool.html for more details on what std algorithms could be optimized by your vendor.

C++11 vector performance issue (with code example)

3 Answers3

Linked

Related