Single producer, single consumer data structure with double buffer in C++

Question

I have an application at $work where I have to move between two real-time threads that are scheduled at different frequencies. (The actual scheduling is beyond my control.) The application is hard real-time-ish (one of the threads has to drive a hardware interface), so the data transfer between the threads should be lock-free and wait-free to the extent possible.

It is important to note that only one block of data needs to be transferred: because the two threads run at different rates, there will be times when two iterations of the faster thread are completed between two wakeups of the slower thread; in this case it is OK to overwrite the data in the write buffer so that the slower thread gets only the latest data.

In other words, instead of a queue a double buffered solution suffices. The two buffers are allocated during initialization, and the reader and write threads can call methods of the class to get pointers to one of these buffers.

C++ code:

#include <mutex>

template <typename T>
class ProducerConsumerDoubleBuffer {
public:
    ProducerConsumerDoubleBuffer() {
        m_write_busy = false;
        m_read_idx = m_write_idx = 0;
    }

    ~ProducerConsumerDoubleBuffer() { }

    // The writer thread using this class must call
    // start_writing() at the start of its iteration
    // before doing anything else to get the pointer
    // to the current write buffer.
    T * start_writing(void) {
        std::lock_guard<std::mutex> lock(m_mutex);

        m_write_busy = true;
        m_write_idx = 1 - m_read_idx;

        return &m_buf[m_write_idx];
    }
    // The writer thread must call end_writing()
    // as the last thing it does
    // to release the write busy flag.
    void end_writing(void) {
        std::lock_guard<std::mutex> lock(m_mutex);

        m_write_busy = false;
    }

    // The reader thread must call start_reading()
    // at the start of its iteration to get the pointer
    // to the current read buffer.
    // If the write thread is not active at this time,
    // the read buffer pointer will be set to the 
    // (previous) write buffer - so the reader gets the latest data.
    // If the write buffer is busy, the read pointer is not changed.
    // In this case the read buffer may contain stale data,
    // it is up to the user to deal with this case.
    T * start_reading(void) {
        std::lock_guard<std::mutex> lock(m_mutex);

        if (!m_write_busy) {
            m_read_idx = m_write_idx;
        }

        return &m_buf[m_read_idx];
    }
    // The reader thread must call end_reading()
    // at the end of its iteration.
    void end_reading(void) {
        std::lock_guard<std::mutex> lock(m_mutex);

        m_read_idx = m_write_idx;
    }

private:
    T m_buf[2];
    bool m_write_busy;
    unsigned int m_read_idx, m_write_idx;
    std::mutex m_mutex;
};

To avoid stale data in the reader thread the payload structure is versioned. To facilitate bidirectional data transfer between the threads, two instances of the above monstrosity are used, in opposite directions.

Questions:

Is this scheme threadsafe? If it's broken, where?
Can it be done without the mutex? Perhaps with just memory barriers or CAS instructions?
Can it be made better?

I like this question. I'll look at it after work and get back to you sometime tomorrow. (I've implemented other lock-free data structures and I'm almost certain this can be made blazing fast without much effort.) — Cameron, May 14 '14 at 22:20
I like this question too. Seems to be correct so far. Has interesting constraints, maybe allows really fast implementation. I would have a use for this too. But I am not sure yet about a good (better) implementation. — Frunsi, May 15 '14 at 00:32
Why do not use *circular buffer*? I guess it will simplify the issue of getting always the latest data. — Rakib, May 15 '14 at 02:10
Alright... this is more complex than I thought :-) I'm working on an implementation, will keep you posted. — Cameron, May 16 '14 at 02:46
what do you want the read thread to do if there's nothing to read? — billmcc, May 16 '14 at 15:26
@Cameron: thanks for taking interest in my problem! (BTW I've seen your lock-free queue implementation, so I'm interested in your take on this) — user3638506, May 16 '14 at 19:16
@billmcc: this is in fact an expected situation in the case when the slower thread sends data to the faster one. The payload data structure has a member that acts as a version number - essentially the iteration count of the writer thread that put in that particular block of data. The reader checks with number, compares it with the one of that was last deals with, and acts accordingly. — user3638506, May 16 '14 at 19:26
Have a look at this, I threw this together really quickly for a side project. It's "lock free" but relies on atomic operations. https://github.com/chadkler/hipoconcon/blob/master/inc/ringbuffer.h — Chad, May 17 '14 at 19:06
@Chad: You'll get better performance (much better on x86) if you use explicit acquire and release memory barriers instead of the default sequentially consistent ones. Also, the `%` will slow things down -- consider forcing the size to a power-of-two so you can use `&` instead. [Facebook's `folly` queue](https://github.com/facebook/folly/blob/master/folly/ProducerConsumerQueue.h), by the way, has the same semantics but all of the improvements I've mentioned (they use an `if` instead of `%` and rely on branch prediction) :-) — Cameron, May 17 '14 at 23:22

Cameron · Accepted Answer · 2014-05-18T05:22:26.490

Very interesting problem! Way trickier than I first thought :-) I like lock-free solutions, so I've tried to work one out below.

There are many ways to think about this system. You can model it as a fixed-size circular buffer/queue (with two entries), but then you lose the ability to update the next available value for consumption, since you don't know if the consumer has started to read the most recently published value or is still (potentially) reading the previous one. So extra state is needed beyond that of a standard ring buffer in order to reach a more optimal solution.

First note that there is always a cell that the producer can safely write to at any given point in time; if one cell is being read by the consumer, the other can be written to. Let's call the cell that can be safely written to the "active" cell (the cell that can be potentially read from is whatever cell isn't the active one). The active cell can only be switched if the other cell is not currently being read from.

Unlike the active cell, which can always be written to, the non-active cell can only be read from if it contains a value; once that value is consumed, it's gone. (This means that livelock is avoided in the case of an aggressive producer; at some point, the consumer will have emptied a cell and will stop touching the cells. Once that happens, the producer can definitely publish a value, whereas before that point, it can only publish a value (change the active cell) if the consumer is not in the middle of a read.)

If there is a value that's ready to be consumed, only the consumer can change that fact (for the non-active cell, anyway); subsequent productions may change which cell is active and the published value, but a value will always be ready to be read until it's consumed.

Once the producer is done writing to the active cell, it can "publish" this value by changing which cell is the active one (swapping the index), provided the consumer is not in the middle of reading the other cell. If the consumer is in the middle of reading the other cell, the swap cannot occur, but in that case the consumer can swap after it's done reading the value, provided the producer is not in the middle of a write (and if it is, the producer will swap once it's done). In fact, in general the consumer can always swap after it's done reading (if it's the only one accessing the system) because spurious swaps by the consumer are benign: if there is something in the other cell, then swapping will cause that to be read next, and if there isn't, swapping affects nothing.

So, we need a shared variable to track what the active cell is, and we also need a way for both the producer and consumer to indicate if they're in the middle of an operation. We can store these three pieces of state into one atomic variable in order to be able to affect them all at once (atomically). We also need a way for the consumer to check if there's anything in the non-active cell in the first place, and for both threads to modify that state as appropriate. I tried a few other approaches, but in the end the easiest was just to include this information in the other atomic variable too. This makes things much simpler to reason about, since all state changes in the system are atomic this way.

I've come up with a wait-free implementation (lock-free, and all operations complete in a bounded number of instructions).

Code time!

#include <atomic>
#include <cstdint>

template <typename T>
class ProducerConsumerDoubleBuffer {
public:
    ProducerConsumerDoubleBuffer() : m_state(0) { }
    ~ProducerConsumerDoubleBuffer() { }

    // Never returns nullptr
    T* start_writing() {
        // Increment active users; once we do this, no one
        // can swap the active cell on us until we're done
        auto state = m_state.fetch_add(0x2, std::memory_order_relaxed);
        return &m_buf[state & 1];
    }

    void end_writing() {
        // We want to swap the active cell, but only if we were the last
        // ones concurrently accessing the data (otherwise the consumer
        // will do it for us when *it's* done accessing the data)

        auto state = m_state.load(std::memory_order_relaxed);
        std::uint32_t flag = (8 << (state & 1)) ^ (state & (8 << (state & 1)));
        state = m_state.fetch_add(flag - 0x2, std::memory_order_release) + flag - 0x2;
        if ((state & 0x6) == 0) {
            // The consumer wasn't in the middle of a read, we should
            // swap (unless the consumer has since started a read or
            // already swapped or read a value and is about to swap).
            // If we swap, we also want to clear the full flag on what
            // will become the active cell, otherwise the consumer could
            // eventually read two values out of order (it reads a new
            // value, then swaps and reads the old value while the
            // producer is idle).
            m_state.compare_exchange_strong(state, (state ^ 0x1) & ~(0x10 >> (state & 1)), std::memory_order_release);
        }
    }

    // Returns nullptr if there appears to be no more data to read yet
    T* start_reading() {
        m_readState = m_state.load(std::memory_order_relaxed);
        if ((m_readState & (0x10 >> (m_readState & 1))) == 0) {
            // Nothing to read here!
            return nullptr;
        }

        // At this point, there is guaranteed to be something to
        // read, because the full flag is never turned off by the
        // producer thread once it's on; the only thing that could
        // happen is that the active cell changes, but that can
        // only happen after the producer wrote a value into it,
        // in which case there's still a value to read, just in a
        // different cell.

        m_readState = m_state.fetch_add(0x2, std::memory_order_acquire) + 0x2;

        // Now that we've incremented the user count, nobody can swap until
        // we decrement it
        return &m_buf[(m_readState & 1) ^ 1];
    }

    void end_reading() {
        if ((m_readState & (0x10 >> (m_readState & 1))) == 0) {
            // There was nothing to read; shame to repeat this
            // check, but if these functions are inlined it might
            // not matter. Otherwise the API could be changed.
            // Or just don't call this method if start_reading()
            // returns nullptr -- then you could also get rid
            // of m_readState.
            return;
        }

        // Alright, at this point the active cell cannot change on
        // us, but the active cell's flag could change and the user
        // count could change. We want to release our user count
        // and remove the flag on the value we read.

        auto state = m_state.load(std::memory_order_relaxed);
        std::uint32_t sub = (0x10 >> (state & 1)) | 0x2;
        state = m_state.fetch_sub(sub, std::memory_order_relaxed) - sub;
        if ((state & 0x6) == 0 && (state & (0x8 << (state & 1))) == 1) {
            // Oi, we were the last ones accessing the data when we released our cell.
            // That means we should swap, but only if the producer isn't in the middle
            // of producing something, and hasn't already swapped, and hasn't already
            // set the flag we just reset (which would mean they swapped an even number
            // of times).  Note that we don't bother swapping if there's nothing to read
            // in the other cell.
            m_state.compare_exchange_strong(state, state ^ 0x1, std::memory_order_relaxed);
        }
    }

private:
    T m_buf[2];

    // The bottom (lowest) bit will be the active cell (the one for writing).
    // The active cell can only be switched if there's at most one concurrent
    // user. The next two bits of state will be the number of concurrent users.
    // The fourth bit indicates if there's a value available for reading
    // in m_buf[0], and the fifth bit has the same meaning but for m_buf[1].
    std::atomic<std::uint32_t> m_state;

    std::uint32_t m_readState;
};

Note that the semantics are such that the consumer can never read a given value twice, and a value it does read is always newer than the last value it read. It's also fairly efficient in memory usage (two buffers, like your original solution). I avoided CAS loops because they're generally less efficient than a single atomic operation under contention.

If you decide use the above code, I suggest you write some comprehensive (threaded) unit tests for it first. And proper benchmarks. I did test it, but only just barely. Let me know if you find any bugs :-)

My unit test:

ProducerConsumerDoubleBuffer<int> buf;
std::thread producer([&]() {
    for (int i = 0; i != 500000; ++i) {
        int* item = buf.start_writing();
        if (item != nullptr) {      // Always true
            *item = i;
        }
        buf.end_writing();
    }
});
std::thread consumer([&]() {
    int prev = -1;
    for (int i = 0; i != 500000; ++i) {
        int* item = buf.start_reading();
        if (item != nullptr) {
            assert(*item > prev);
            prev = *item;
        }
        buf.end_reading();
    }
});
producer.join();
consumer.join();

As for your original implementation, I only looked at it cursorily (it's much more fun to design new stuff, heh), but david.pfx's answer seems to address that part of your question.

Thanks! Nice work! Unfortunately, it is desirable in my application that the reader _can_ read the same value twice (for the case where the slower thread sends data to the faster one). I would have to rethink the implementation of the reader part of the fast thread if I were to use this solution. Additionally, I find the bit magic to handle internal state hard to understand and fragile. E.g. repeated calls to any of the four methods will mess up the state (I don't know if this constitutes a bug, but it's at least a weak side of the implementation). — user3638506, May 17 '14 at 17:55
Also, it seems to me that the entire state change is not atomic in some cases: you fetch the value of m_state, do some magic with it, then store it back a few instructions later. — user3638506, May 17 '14 at 17:56
@user: I implemented the strongest design constraints. If you want the reader to keep reading the old value until a new one is published, simply add a cache item and set it after a successful read of a fresh value; then when a value is not available in `begin_reading`, return a pointer to that value instead of nullptr (provided it's been set in the first place, of course). No rethinking needed ;-) — Cameron, May 17 '14 at 18:40
@user: I agree that the bit manipulation is a little hard to follow, but it is correct as far as I can tell. I'm not sure what you mean by "repeated calls", but surely any call to begin_x should be matched with exactly one call to end_x? As for the state changes, each change itself is atomic, even if the full state change is spread across several atomic updates. This works because of the guarantee about no swapping being done while the user count is above zero. At any single point in time, the state as a whole is valid. I'll add my unit test for your perusal. — Cameron, May 17 '14 at 18:43
@user: Yes, the bit manipulation is fragile. But *any* lock-free data structure is extremely fragile to start with -- not one line can be thought about without thinking of the algorithm as a whole -- there is no such thing as a trivial edit when multiple threads are involved :-) (Unless, of course, you're editing a part of it that doesn't relate to the synchronization, such as adding the cached read value to make a value always available for reading once the first one has been read.) — Cameron, May 17 '14 at 18:53
@user: Check out my edit, I managed to remove one atomic operation (makes writing even faster). — Cameron, May 18 '14 at 05:26
The "+ flag - 0x2" and similar expressions at the end of some of the atomic operations shouldn't be there, right? — user3638506, May 18 '14 at 09:29
Disregard my previous comment, now I see that the atomic ops return the previous value. — user3638506, May 18 '14 at 16:17
@user: Interestingly, you can trade speed for throughput: the relaxed load at the beginning of `start_reading` is fast, but it won't always read the most recent value (meaning it might think there's nothing to read for a little while even once there is); you can replace it with an atomic RMW operation to force the most recent value to load: `m_state.compare_exchange_weak(m_readState, m_readState, std::memory_order_relaxed);`. This results in about twice the throughput (# of individual writes read) but at half the total speed (due to increased contention). — Cameron, May 19 '14 at 03:31
...ignore my last comment. The throughput is the same in both cases, because while the number of items transferred from one thread to another doubles, the time doubles also, resulting in the same # of items per unit time (throughput). Not sure what I was thinking :-) — Cameron, May 20 '14 at 14:55

david.pfx · Answer 2 · 2014-05-18T04:15:46.863

Yes, I think it's broken.

If the reader does a start/end/start in succession it will update its read index to the write index, and potentially read data from the write index, even if the write is busy.

The problem essentially is that the writer doesn't know which buffer the reader will use, so the writer should ensure that both buffers are valid at all times. It can't do that, if it's going to take any time to write data into a buffer [unless I misunderstood some of the logic that isn't shown here.]

Yes, I think it can be done without locks, using CAS or equivalent logic. I'm not going to try to express an algorithm in this space. I'm confident that it exists, but not that I can write it out correctly first time. And a bit of web searching turned up some plausible candidates. Wait-free IPC using CAS appears to be quite an interesting topic and the subject of some research.

After some further thought, the algorithm is as follows. You need:

3 buffers: one for the writer, one for the reader to use, and one extra. The buffers are ordered: they form a ring (but see note).
A status for each buffer: free, full, writing, reading.
A function that can inspect the status of the buffer and conditionally change the status to a different value in a single atomic operation. I shall use CSET for that.

Writer:

Find the first buffer that is FREE or FULL
  Fail: assert (should never fail, reader can only use one buffer)
  CSET buffer to WRITING
Write into the buffer
CSET buffer to FULL

Reader:

Find first buffer that is FULL
    Fail: wait (writer may be slow)
    CSET buffer to READING
Read and consume buffer
CSET buffer to FREE

Note: This algorithm does not guarantee that buffers are treated strictly in order of arrival, and no simple change will make it do so. If this is important, the algorithm should be enhanced with a sequence number on the buffer, set by the writer so that the most recent buffer can be chosen by the reader.

I leave the code as an implementation detail.

The CSET function is non-trivial. It has to atomically test that a particular shared memory location is equal to an expected value and if so change it to a new value. It returns true if it successfully made the change and false otherwise. The implementation must avoid race conditions if two threads access the same location at the same time (and possibly on different processors).

The C++ standard atomic operations library contains a set of atomic_compare_exchange functions which should serve the purpose, if available.

Ahh, you're right. The problem of a start/end/start sequence exposing the write buffer to the writer can be patched by adding a check of the busy flag in end_writing(), but that leads to a different issue: it will be possible for certain combinations of timing that the reader thread never gets any new data because it always happens to be called when the writer is busy. — user3638506, May 16 '14 at 19:05
I think your 4 state buffers (FREE, WRITING, FULL, READING) with only the changing of state being atomical is **the** solution. But IMHO, only 2 buffers should be used, and if you want to be sure that sequencing is respected it is sufficient to slightly modify the "end write" : when setting a buffer to FULL if the other is also FULL, it should be set to FREE. That guaranties that only one buffer can be FULL at a given time, and is coherent with assumption that only one buffer should be treated at same time. — Serge Ballesta, May 17 '14 at 08:36
@SergeBallesta: The requirement was to minimise waiting. With 3 buffers the writer never waits and the reader never waits if there is data. — david.pfx, May 17 '14 at 08:50
@david.pfx ; You are right but with 2 buffers, writer never waits : if one buffer is READING and other is FULL, re-use FULL. I acknowledge that reader might never see a buffer that it could treat but only if writer IS writing - so the buffer already contains old data - and reader has just ended reading. — Serge Ballesta, May 17 '14 at 09:44
It seems that I indeed need 3 buffers if I want to guarantee that the reader always have a buffer regardless of the writer being active. However, are you sure that it is enough for just the "set state" operation to be atomic? E.g. it's possible that the reader process hijacks a FULL buffer between the writer finding it and setting its state to WRITING. — user3638506, May 17 '14 at 15:15
@user3638506: The CSET function is required to conditionally test and set the state as a single atomic operation. If the CSET for the writer succeeds, the CSET for the reader will not. Once you have this function, all else follows. — david.pfx, May 17 '14 at 15:39

score 0 · Answer 3 · answered Jan 25 '15 at 04:37

Here a version using InterlockedExchangePointer() and SLISTs.

This solution does not support re-reading of last buffer. But if that is needed it can be done on reader side by means of a copy and a if( NULL == doubleBuffer.beginReader(...) ) { use backup copy ... }.
This is not done because it is hard to add, but because it is not very realistic. Imagine your last known value turns older and older - seconds, days, weeks. It is unlikely that the application will still want to use it. So, factoring re-read functionality into the double buffer code takes away flexibility from the application.

The double buffer has 1 read pointer member. Whenever beginRead() is called, this value is returned and atomically replaced with NULL. Think of it as "The reader TAKES the buffer."
With endRead(), the reader returns the buffer and it is added to the SLIST, containing the available buffers for write operations.

Initially, both buffers are added to the SLIST, the read pointer is NULL.

beginWrite() pops the next available buffer from the SLIST. And this value cannot ever be NULL, because of the way endWrite() is implemented.

Last not least, endWrite() atomically swaps the read pointer with the returned, freshly written buffer and if the read pointer was not NULL, it pushes it onto the SLIST.

So, even if the reader side never reads, the writer side never runs out of buffers. When the reader reads, it gets the latest known value (once!).

What this implementation is not safe against is if there are multiple concurrent readers or writers. But that was not the objective in the first place.

On the ugly side, the Buffers need to be structures with some SLIST_HEADER member on top.

Here, the code, but keep in mind that it is not my fault if your mars rover lands on venus!

const size_t MAX_DATA_SIZE = 512;
typedef
//__declspec(align(MEMORY_ALLOCATION_ALIGNMENT))
struct DataItem_tag
{
    SLIST_ENTRY listNode;
    uint8_t data[MAX_DATA_SIZE];
    size_t length;
} DataItem_t;

class CDoubleBuffer
{
    SLIST_HEADER m_writePointers;
    DataItem_t m_buffers[2];
    volatile DataItem_t *m_readPointer;

public:
    CDoubleBuffer()
        : m_writePointers()
        , m_buffers()
        , m_readPointer(NULL)
    {
        InitializeSListHead(&m_writePointers);
        InterlockedPushEntrySList(&m_writePointers, &m_buffers[0].listNode);
        InterlockedPushEntrySList(&m_writePointers, &m_buffers[1].listNode);
    }
    DataItem_t *beginRead()
    {
        DataItem_t *result = reinterpret_cast<DataItem_t*>(InterlockedExchangePointer((volatile PVOID*)&m_readPointer, NULL));
        return result;
    }
    void endRead(DataItem_t *dataItem)
    {
        if (NULL != dataItem)
        {
            InterlockedPushEntrySList(&m_writePointers, &dataItem->listNode);
        }
    }
    DataItem_t *beginWrite()
    {
        DataItem_t *result = reinterpret_cast<DataItem_t*>(InterlockedPopEntrySList(&m_writePointers));
        return result;
    }
    void endWrite(DataItem_t *dataItem)
    {
        DataItem_t *oldReadPointer = reinterpret_cast<DataItem_t*>(InterlockedExchangePointer((volatile PVOID*)&m_readPointer, dataItem));
        if (NULL != oldReadPointer)
        {
            InterlockedPushEntrySList(&m_writePointers, &oldReadPointer->listNode);
        }
    }
};

And here the test code for it. (For both, the above and the test code you need <windows.h> and <assert.h>.)

CDoubleBuffer doubleBuffer;

DataItem_t *readValue;
DataItem_t *writeValue;

// nothing to read yet. Make sure NULL is returned.
assert(NULL == doubleBuffer.beginRead());
doubleBuffer.endRead(NULL); // we got nothing, we return nothing.

// First write without read
writeValue = doubleBuffer.beginWrite();
assert(NULL != writeValue); // if we get NULL here it is a bug.
writeValue->length = 0;
doubleBuffer.endWrite(writeValue);

// Second write without read
writeValue = doubleBuffer.beginWrite();
assert(NULL != writeValue); // if we get NULL here it is a bug.
writeValue->length = 1;
doubleBuffer.endWrite(writeValue);

// Third write without read - works because it reuses the old buffer for the new write.
writeValue = doubleBuffer.beginWrite();
assert(NULL != writeValue); // if we get NULL here it is a bug.
writeValue->length = 2;
doubleBuffer.endWrite(writeValue);

readValue = doubleBuffer.beginRead();
assert(NULL != readValue); // NULL would obviously be a terrible bug.
assert(2 == readValue->length); // We got the latest and greatest?
doubleBuffer.endRead(readValue);

readValue = doubleBuffer.beginRead();
assert(NULL == readValue); // We expect NULL here. Re-reading is not a feature of this implementation!
doubleBuffer.endRead(readValue);

Single producer, single consumer data structure with double buffer in C++

3 Answers3