Loading an STL set with pre-sorted data, C++

Question

I'm working with C++ in Visual Studio 2010. I have an STL set, which I'm saving to file when my program shuts down. The next time the program starts up, I load the (sorted) data back into a set. I'm trying to optimize the loading process, and I'm having trouble. I suspect the problem is with frequent re-balancing, and I'm looking for a way to avoid that.

First, I did it with no optimization, using "set->insert (const value_type& x )"

Time: ~5.5 minutes

Then I tried using the version of insert() where you pass in a hint for the location of the insert():

iterator insert ( iterator position, const value_type& x );

Roughly, I did this:

set<int> My_Set;
set<int>::iterator It;
It = My_Set.insert (0);
for (int I=1; I<1000; I++) {
   It = My_Set.insert (It, I);  //Remember the previous insertion's iterator
   }

Time: ~5.4 minutes

Barely any improvement! I don't think the problem is with overhead in reading from file--commenting out the insert() reduces the time to 2 seconds. I don't think the problem is with overhead in copying my object--it's a Plain Old Data object with an int and a char.

The only thing I can think of is that the set is constantly re-balancing.

1.) Do you agree with my guess?

2.) Is there a way to "pause" the rebalancing while I load the set, and then rebalance once at the end? (Or... Would that even help?)

3.) Is there a smarter way to load the sorted data, i.e. not simply moving from lowest to highest? Perhaps alternating my insertions so that it doesn't have to balance often? (Example: Insert 1, 1000, 2, 999, 3, 998,...)

It's a DEBUG build, yet. But we're working with large quantities of data, so ~5.5min doesn't surprise me. — Jugulum, Mar 23 '11 at 20:38
@user673772: if reading from the file takes 2 seconds, then 5 minutes to insert into a set sounds like what I'd consider surprising. Enable optimizations before you do *any* kind of performance measurements. — jalf, Mar 23 '11 at 20:52

MacGucky · Accepted Answer · 2011-03-23T21:57:44.870

About how many elements we are talking?

I made an short test with 10.000.000 integers (prepared in an vector) and inserted them in three different ways into the set.

Prepare Input:

  std::vector<int> input;
  for(int i = 0; i < 10*1000*1000; ++i) {
     input.push_back(i);
  }

Insert into set item by item with insert:

Release: 2,4 seconds / Debug: 110,8 seconds

  std::set<int> mySet;
  std::for_each(input.cbegin(), input.cend(), [&mySet] (int value) {
     mySet.insert(value);
  });

Insert into set with insert(itBegin, itEnd):

Release: 0,9 seconds / Debug: 47,5 seconds

  std::set<int> mySet;
  mySet.insert(input.cbegin(), input.cend());

  // this is also possible - same execution time:
  std::set<int> mySet(input.cbegin(), input.cend());

So insertion could be heavily speeded up, but even the slow way should be far from several minutes.

EDIT:

I made an test with debug-mode meanwhile - wow - I know debug costs performance, but it is more than I thought. With 50.000.000 elements there is an bad alloc in debug-mode, so I updated my post to 10.000.000 elements and showed times for release and debug build.

You could see the immense differences here - 50 times with the faster solution.

Additionally the fast solution (insert(itBegin, itEnd)) seems to be linear to the amount of elements (with presorted data!). The previus test had five times more elements and the insert-time was reduced from 4,6 to 0,9 - about five times.

Thanks, I'll have to try it in release mode tomorrow. (I'm waiting for someone else in the project to fix a compiler error that only shows up in release mode.) In debug mode, I'm getting the following times: 1.) Set.insert (Val) -- 334sec 2.) Prev_Iter=Set.insert (Prev_Iter, Val) -- 339sec 3.) Set.insert(Set.end(), Val) -- 329sec 4.) push_back() everything into a vector, then Set.insert (Vect.begin(), Vect.end()) -- 347sec That data is very different from yours, and it makes no sense--there's got to be something happening that's related to debug-mode. — Jugulum, Mar 23 '11 at 22:55

Martin York · Answer 2 · 2011-03-24T02:42:14.407

Have you tried the range constructor?

#include <set>
#include <fstream>
#include <algorithm>
#include <iterator>

int main()
{
    std::ifstream  file("Plop");

    std::set<int>   myset;

    std::copy(std::istream_iterator<int>(file),
              std::istream_iterator<int>(),
              std::inserter(myset, myset.end()));
}

Tried 4 techniques with [0 -> 10,000,000) items (sorted in file):

void t1(std::set<int>& data, std::istream& file)
{
    int x;
    while(file >> x)    {data.insert(x); }
}

void t2(std::set<int>& data, std::istream& file)
{
    int x;
    while(file >> x)    {data.insert(data.end(), x);}
}

void t3(std::set<int>& data, std::istream& file)
{
    std::set<int>::iterator it = data.begin();
    int x;
    while(file >> x)    {it = data.insert(it, x);}
}

void t4(std::set<int>& data, std::istream& file)
{
    std::copy(std::istream_iterator<int>(file),
              std::istream_iterator<int>(), 
              std::inserter(data, data.end()));
}

Times in clock() average over 3 runs (normal) and 3 runs(-O4)

                    Plain Data
           Normal              -O4
           =========           ========= 
t1 Result: 21057300            6748061
t2 Result:  6580081            4752549
t3 Result:  6675929            4786003
t4 Result:  8452749            6460603

Conclusion 1: for sorted data:

Best:   data.insert(data.end(), <item>)  // Hint end()
Worst:  data.insert(<item>);             // No Hint

Conclusion 2: Optimization counts.

I would use `set`'s iterator constructor directly. – GManNickG Mar 23 '11 at 21:08 — GManNickG, Mar 23 '11 at 21:08

score 1 · Answer 3 · answered Mar 23 '11 at 20:55

It's possible the set is rebalancing. How many items do you REALLY have that take 5.6 min? If your set of items is big enough you might be hitting physical RAM limits and thrashing, or just having really bad cache misses.

There's definitely no way to disable the rebalancing. If you could, then the set would be able to break its invariants which would be bad.

Get a profiler and profile your code rather than guess what's taking the time.
Did you try the two param insert using end instead of the previous iterator as another data point?
Did you try inserting into a pre-reserved vector instead to compare the time?
Can you get away with another container type like heap or (sorted) vector?
If you can quickly load into a vector, do that, then random_shuffle it, and then try inserting into the set again and see what happens.

I think I need a set, because I have lots of look-ups happening. A sorted vector is a possibility (doing a binary search on it), but I may have to do on-the-fly insertions, too. So if I can fix this hang-up in the initial load-up, a set seems preferable. — Jugulum, Mar 23 '11 at 23:05
On the other matters: Two-param insert using end() was about the same, as was inserting into a pre-reserved vector followed by insert(Vect.begin(), Vect.end()). I'll try the random_shuffle. — Jugulum, Mar 23 '11 at 23:09

Loading an STL set with pre-sorted data, C++

3 Answers3