I am trying to create a simulation using boost library, but I encountered a problem on asynchronous communication of processes. In our case, there are 2 processes which sends/receives messages from/to each other (using isend and ireceive commands). If I wait for all send/receive commands to complete, then everything is OK. So, this is my working code:

boost::mpi::communicator* comm;
// Initialize MPI and etc.

std::vector<boost::mpi::request> sendRequests;
std::vector<boost::mpi::request> receiveRequests;

for(int i=0; i< 10; i++){
    receiveRequests.push_back(comm->irecv(0, 3000, receivedMessage));
    sendRequests.push_back(comm->isend(1, 3000, sentMessage));

    boost::mpi::wait_all(receiveRequests.begin(), receiveRequests.end());

However, I want to cancel receiving messages if it takes too much time. So, I try to test if the communication is completed or not, using test and cancel function. So, I modified my code just like below:

boost::mpi::communicator* comm;
// Initialize MPI and etc.

std::vector<boost::mpi::request> sendRequests;
std::vector<boost::mpi::request> receiveRequests;

for(int i=0; i< 10; i++){
    receiveRequests.push_back(comm->irecv(0, 3000, receivedMessage));
    sendRequests.push_back(comm->isend(1, 3000, sentMessage));

    vector<boost::mpi::request>::iterator it = receiveRequests.begin();
    while(it != receiveRequests.end()){

Now, my program crashes and I get this error after the first iteration of the loop:

terminate called after throwing an instance of 'std::length_error'
what():  vector::_M_fill_insert
terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception> >'
what():  MPI_Test: Message truncated, error stack:
PMPI_Test(168)....................: MPI_Test(request=0x13bba24, flag=0x7fff081a7bd4, status=0x7fff081a7ba0) failed
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 3000 truncated; 670 bytes received but buffer size is 577

So, I'd like to know how to resolve this error.

  • 311
  • 1
  • 5
  • 21
  • There isn't enough code here to reproduce the error, so all you are going to get is guesses. We can say that, as per the error message, the issue isn't cancelling, it's testing. In particular, it looks like you are posting a number of receives (from different ranks? who knows?) with the same tag (I guess) but different lengths (someone sent 670 bytes long, but you were expecting something 577 bytes long). So when the test occurs, that receive is attempted and fails. For some reason, in the original code, perhaps due to the increased synchronization (who can say?) that didn't happen. – Jonathan Dursi Dec 04 '14 at 17:15
  • I tried to send requests with different tags, but the result is the same. At every iteration in the loop, processes send messages which have different tags to each other, but didn't work. As you have said, it looks the error is on test function. Even if I do not cancel the request (but just test), I still get the same error. I'd like to provide some more code, but I do not know what I can put. Because just using test method instead of wait looks the reason of error. And I do not have much more code in addition to these lines. I mean they are unrelated to MPI part. – montekristo_07 Dec 05 '14 at 14:40

2 Answers2


Where does it come from? It's nowhere

Note that push_back could reallocate and this invalidates any pending iterators.

Also note that you need to conditionally increment it in case you did the removal. The typical pattern is

 it = receiveRequests.erase(it);

Update I see you have added information to the question. It should probably be:

vector<boost::mpi::request>::iterator it = receiveRequests.begin();
while(it != receiveRequests.end()){
    it = receiveRequests.erase(it);

I'm not sure why you always erase every receive request. I'm assuming that's the intent

  • 328,274
  • 43
  • 416
  • 565
  • Thanks for your answer. I corrected the code above and I tried to increment iterator as you said. Now, it runs non-deterministically. Sometimes I get the same error in the first iteration, sometimes in the 3rd iteration and so on. And sometimes I never get this error and program terminates as I expect. – montekristo_07 Dec 04 '14 at 14:59
  • 1
    Do you increment the iterator elsewhere? It should **not** be incremented if you deleted the item. Also, have you made sure the vector cannot reallocate (see **[iterator invalidation rules](http://stackoverflow.com/questions/6438086/iterator-invalidation-rules)**). – sehe Dec 04 '14 at 15:00
  • Updated my answer since you've added more information to the question – sehe Dec 04 '14 at 15:04
  • Actually, you're right, I do not want to erase all the receive requests, but it is only for testing. So, I erase them just for now. I just updated my code as you mentioned and do not play with iterator anywhere else. – montekristo_07 Dec 04 '14 at 15:06
  • Then you have another problem somewhere else. Good luck debugging and consider posting another question if you get stuck – sehe Dec 04 '14 at 15:08
  • Actually, I wonder that if there might be another problem in test and cancel methods. What I mean is that: What if the program runs test function first and then communication is completed? Lastly, it tries to run cancel function? Since I'm not very familiar with MPI, I just wonder. – montekristo_07 Dec 04 '14 at 15:17
  • 1
    Probably something for the dev mailing list: http://lists.mcs.anl.gov/pipermail/mpich-discuss/2006-February/001173.html: _"If the request is completed by a test or wait, it is set to MPI_REQUEST_NULL. See if adding an "if (request != MPI_REQUEST_NULL)" around the MPI_Cancel helps."_ – sehe Dec 04 '14 at 15:20

Finally, I figured it out. It was just because of the race condition between test and cancel methods. Since there are hundreds of message requests during the run-time, sometimes this situation occurs. After testing a request, the program cannot cancel it, because it has just finished (after the test method, but before the cancel method). That's why it occurs irregularly. So, I had to change the way what I wanted to do and remove the cancel method.

  • 311
  • 1
  • 5
  • 21