21

I am curious to know if the allocating memory using a default new operator is a non-blocking operation.

e.g.

struct Node {
    int a,b;
};

...

Node foo = new Node();

If multiple threads tried to create a new Node and if one of them was suspended by the OS in the middle of allocation, would it block other threads from making progress?

The reason why I ask is because I had a concurrent data structure that created new nodes. I then modified the algorithm to recycle the nodes. The throughput performance of the two algorithms was virtually identical on a 24 core machine. However, I then created an interference program that ran on all the system cores in order to create as much OS pre-emption as possible. The throughput performance of the algorithm that created new nodes decreased by a factor of 5 relative the algorithm that recycled nodes.

I'm curious to know why this would occur.

Thanks.

*Edit : pointing me to the code for the c++ memory allocator for linux would be helpful as well. I tried looking before posting this question, but had trouble finding it.

HoldOffHunger
  • 10,963
  • 6
  • 53
  • 100
Mark
  • 2,867
  • 4
  • 23
  • 37
  • 2
    Interesting question. "Non-blocking" is not the right word, though, I think. The thread that asks for memory is of course blocked until it gets the memory. What you are asking is if other threads would also be blocked in their memory allocations (my guess is yes, since heap memory is a shared resource). Don't have a good term for that, maybe "memory allocation concurrency". – Thilo Jan 05 '11 at 02:08
  • 3
    "non-blocking" is the right terminology. Concurrent algorithms fall into the classes of either locking, lock-free, non-blocking, or wait free. Locking algorithms are obvious; however, there are subtle distinctions between the last three classes. – Mark Jan 05 '11 at 06:18
  • It all depends. Some systems have different versions of the standard library that are linked with the executable if threading is enabled. – Martin York Jan 05 '11 at 06:36
  • interesting question, since calls to `mmap` and other system calls to acquire memory can take quite some time. There might not be a single answer though, I imagine that some implementations may not block when one thread reuse memory while the other performs a `mmap` but do block if both must perform a system call, etc... – Matthieu M. Jan 05 '11 at 08:20

5 Answers5

9

seems to me if your interference app were using new/delete (malloc/free), then the interference app's would interfere with the non recycle test more. But I don't know how your interference test is implemented.

Depending on how you recycle (ie if you use pthread mutexes god forbid) your recycle code could be slow (gcc atomic ops would be 40x faster at implementing recycle).

Malloc, in some variation for a long time on at least some platforms, has been aware of threads. Use the compiler switches on gcc to be sure you get it. Newer algorithms maintain pools of small memory chunks for each thread, so there is no or little blocking if your thread has the small item available. I have over simplified this and it depends on what malloc your system is using. Plus, if you go and allocate millions of items to do a test....well then you wont see that effect, because the small item pools are limited in size. Or maybe you will. I don't know. If you freed the item right after allocating, you would be more likely to see it. Freed small items go back into the small item lists rather than the shared heap. Although "what happens when thread B frees an item allocated by thread A" is a problem that may or may not be dealt with on your version of malloc and may not be dealt with in a non blocking manner. For sure, if you didn't immediately free during a large test, then the the thread would have to refill its small item list many times. That can block if more than one thread tries. Finally, at some point your process' heap will ask the system for heap memory, which can obviously block.

So are you using small memory items? For your malloc I don't know what small would be, but if you are < 1k that is for sure small. Are you allocating and freeing one after the other, or allocating thousands of nodes and then freeing thousands of nodes? Was your interference app allocating? All of these things will affect the results.

How to recycle with atomic ops (CAS = compare and swap):

First add a pNextFreeNode to your node object. I used void*, you can use your type. This code is for 32 bit pointers, but works for 64 bit as well. Then make a global recycle pile.

void *_pRecycleHead; // global head of recycle list. 

Add to recycle pile:

void *Old;
while (1) { // concurrency loop
  Old = _pRecycleHead;  // copy the state of the world. We operate on the copy
  pFreedNode->pNextFreeNode = Old; // chain the new node to the current head of recycled items
  if (CAS(&_pRecycleHead, Old, pFreedNode))  // switch head of recycled items to new node
    break; // success
}

remove from pile:

void *Old;
while (Old = _pRecycleHead) { // concurrency loop, only look for recycled items if the head aint null
  if (CAS(&_pRecycleHead, Old, Old->pNextFreeNode))  // switch head to head->next.
    break; // success
}
pNodeYoucanUseNow = Old;

Using CAS means the operation will succeed only if the item you are changing is the Old value you pass in. If there is a race and another thread got there first, then the old value will be different. In real life this race happens very very rarely. CAS is only slighlty slower than actually setting a value so compared to mutexes....it rocks.

The remove from pile, above, has a race condition if you add and remove the same item rapidly. We solve that by adding a version # to the CAS'able data. If you do the version # at the same time as the pointer to the head of the recycle pile you win. Use a union. Costs nothing extra to CAS 64 bits.

union TRecycle {
  struct {
    int iVersion;
    void *pRecycleHead;
  } ;  // we can set these.  Note, i didn't name this struct.  You may have to if you want ANSI
  unsigned long long n64;  // we cas this
}

Note, You will have to go to 128 bit struct for 64 bit OS. so the global recycle pile looks like this now:

TRecycle _RecycleHead;

Add to recycle pile:

while (1) { // concurrency loop
  TRecycle New,Old;
  Old.n64 = _RecycleHead.n64;  // copy state
  New.n64 = Old.n64;  // new state starts as a copy
  pFreedNode->pNextFreeNode = Old.pRecycleHead;  // link item to be recycled into recycle pile
  New.pRecycleHead = pFreedNode;  // make the new state
  New.iVersion++;  // adding item to list increments the version.
  if (CAS(&_RecycleHead.n64, Old.n64, New.n64))  // now if version changed...we fail
    break; // success
}

remove from pile:

while (1) { // concurrency loop
  TRecycle New,Old;
  Old.n64 = _RecycleHead.n64;  // copy state
  New.n64 = Old.n64;  // new state starts as a copy
  New.pRecycleHead = New.pRecycledHead.pNextFreeNode;  // new will skip over first item in recycle list so we can have that item.
  New.iVersion++;  // taking an item off the list increments the version.
  if (CAS(&_RecycleHead.n64, Old.n64, New.n64))  // we fail if version is different.
    break; // success
}
pNodeYouCanUseNow = Old.pRecycledHead;

I bet if you recycle this way you will see a perf increase.

johnnycrash
  • 4,674
  • 5
  • 29
  • 51
4

In multithreaded systems, malloc() and free() (and new / delete) do typically use synchronisation primitives to make them safe to call from multiple threads.

This synchronisation does also affect the performance of some applications, particularly appliations that do a lot of allocation and deallocation in highly parallel environments. More efficient multithreaded memory allocators are an active field of research - see jemalloc and tcmalloc for two well-known ones.

caf
  • 216,678
  • 34
  • 284
  • 434
  • thanks for jemalloc, didn't know about it :) Given that both jemalloc and tcmalloc use thread-local caching, I would surmise that they are non-blocking. – Matthieu M. Jan 05 '11 at 08:11
  • @Matthieu M.: In the fast path, yes. There would still be slow paths that are triggered sometimes that use locking. You can't really get away from that, because the allocator needs to be able to handle corner cases, like a large volume of allocations in thread A that are freed by thread B. – caf Jan 06 '11 at 02:23
3

This is really pretty much the same as this question.

Basically, malloc isn't defined to be thread safe, but implementors are free to add implementation to make it thread safe. From your description, it sounds like your particular version is.

To be sure, in the words of Obi-Wan, "Use the Source, Luke." The malloc source will be around and it's generally pretty straightforward to read.

@Mark, you can get the standard GNU libc source by

$ git clone git://sourceware.org/git/glibc.git
$ cd glibc
$ git checkout --track -b glibc-2_11-branch origin/release/2.11/master

See also here. Remember that malloc is in manual section 3 -- it's a library function, so it won't be in your kernel sources. You might, however, need to read down into brk,sbrk, getrlimit and setrlimit and the like to find out what the kernel does.

One more link: the GCC project.

Okay, one more (I can stop any time): here's a page from which you can download the sources. Untar the file and you should find it at ./malloc/malloc.c.

Community
  • 1
  • 1
Charlie Martin
  • 103,438
  • 22
  • 180
  • 253
  • 1
    Sorry, but the question wasn't if malloc was thread-safe. For there to be any concurrent programming, there has to be some sort of thread safe memory allocation algorithm. What I want to know is if the the memory allocator in linux is a non-blocking algorithm (this is different than being thread-safe, or lock-free) – Mark Jan 05 '11 at 06:58
  • Don't be silly. Mark, "thread safe" and "concurrent" require the same property -- that atomicity is preserved throughout the critical section of the operation, whether the thread of control is being handled by a lightweight method (what we usually call a "thread"), a heavyweight context switch in a multiprogramming model, or by actual parallel computations in shared-memory multiprocessing. When you ask if it's "non-blocking" you're just asking if the critical section is handled in a way that allows multiple threads to proceed. – Charlie Martin Jan 05 '11 at 15:20
  • In any case, the best *answer* is to read the source. The source knows all the answers. – Charlie Martin Jan 05 '11 at 15:21
  • I think we're using different interpretations of "non-blocking". A non-blocking algorithm cannot contain a critical section that can be interrupted which would prevent progress of other threads. See "Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors" - http://www.cs.rochester.edu/~scott/papers/1998_JPDC_nonblocking.pdf I would really like to see the source, but have been having trouble tracking it down. – Mark Jan 05 '11 at 22:51
  • @Mark, which version of linux and compiler are you using? – Charlie Martin Jan 06 '11 at 02:30
  • @Charlie I've been using both Ubuntu 2.6.28 with gcc 4.3.3 and Gentoo 2.6.29 with gcc 4.3.4; results are the same on both systems. – Mark Jan 06 '11 at 02:57
  • Okay, so go read the code I pointed out -- libgcc 2.9 should be at least close, but you can certainly find the exact version by getting the gcc source for 4.3.3. The 2.6.28 kernel is easily accessible, see eg LXR: http://lxr.linux.no/linux+v2.6.28/, but for malloc you want the libc. – Charlie Martin Jan 06 '11 at 03:59
2

This question has a number of good responses: In multithreaded C/C++, does malloc/new lock the heap when allocating memory.

The consensus there is that there is locking. So a big allocation or one that requires some swapping could block a smaller allocation in another thread, even if the smaller one could finish if not for the larger allocation in progress.

gcc's new is thread-safe, if you compile with pthreads support, but that's not really what you're asking.

I know in windows you can create your own heap, which could be used to set up memory at the beginning of your program. I'm unaware of any linux/unix calls to do similar things.

Community
  • 1
  • 1
Paul Rubel
  • 24,802
  • 7
  • 54
  • 76
  • 2
    Out of interest [Leap Heap](http://www.leapheap.com/) is a non-blocking custom heap for Windows. They have great information on the website on its internals - interesting read. – Tim Lloyd Jan 05 '11 at 02:29
  • @Chibacity, Thanks for sharing that link. – Vikram.exe Jan 05 '11 at 07:26
0

Short answer: No.

One thread can be in the middle of new node(), and another thread can also go do new node(). The first thread can be suspended, and the second might finish first. It's fine. (assuming nothing in your constructor uses a mutex)

Longer answer: Multithreading is a jungle. Thread-unsafe code might work fine for a decade, and then fail every day for a week. Race conditions might not trigger any trouble on your machine, but blow up on a customer's machine. Multi-threaded apps introduce a level of uncertainty, which takes extra effort to write and understand.

So, why would these two programs run nearly identical one day, and massively different with cpu contention? I don't know. new doesn't block other threads from doing new, so it's not that. I suspect that with the extra overhead of new/delete, the OS has more opportunity to preempt your program (and maybe even more likelihood to do so). Thus, when there's no interference, the two programs get the cpu as much as they want, and run fine - but when the cpu is a scarce resource, the new/delete program gets bumped more often than the recycling one. See? It pays to recycle ;-)

Tim
  • 8,474
  • 3
  • 37
  • 54
  • 1
    Perhaps this is because `malloc` and `free` require a context switch into the kernel, and that's a perfect time for the kernel to preempt. Otherwise it has to preempt the process in userland, and I don't think it likes to do that as much. – cdhowie Jan 05 '11 at 02:27
  • 1
    @cdhowie: I wouldn't be surprised if `free` did *not* always require a context switch. `malloc` has to be synchronous, while `free` can simply be postponed to reduce the number of context switches (I don't know if it is the case in Linux, though). This could explain the difference in performances – Roman L Jan 05 '11 at 02:40
  • 1
    Malloc requires a context switch to the kernel? On what planet is that? – bmargulies Jan 05 '11 at 02:58
  • The algorithm which is calling the new operator is a non-blocking algorithm itself, so it doesn't have anything to do extending a critical section of the code as there is no critical section. – Mark Jan 05 '11 at 06:14
  • @bmargulis, it *may* require a context switch; it has to make a kernel call if it needs new virtual memory. – Charlie Martin Jan 07 '11 at 21:59