How to find details of the Windows C++ memory allocator I am using?

Question

How can I find details of the Windows C++ memory allocator that I am using?

Debugging my C++ application is showing the following in the call stack:

ntdll.dll!RtlEnterCriticalSection()  - 0x4b75 bytes 
ntdll.dll!RtlpAllocateHeap()  - 0x2f860 bytes   
ntdll.dll!RtlAllocateHeap()  + 0x178 bytes  
ntdll.dll!RtlpAllocateUserBlock()  + 0x56c2 bytes   
ntdll.dll!RtlpLowFragHeapAllocFromContext()  - 0x2ec64 bytes    
ntdll.dll!RtlAllocateHeap()  + 0xe8 bytes   
msvcr100.dll!malloc()  + 0x5b bytes 
msvcr100.dll!operator new()  + 0x1f bytes

My multithreaded code is scaling very poorly, and profiling through random sampling indicates that malloc is currently a bottleneck in my multithreading code. The stack seems to indicate some locking going on during memory allocation. How can I find details of this particular malloc implementation?

I've read that Windows 7 system allocator performance is now competitive with allocators like tcmalloc and jemalloc. I am running on Windows 7 and I'm building with Visual Studio 2010. Is msvcr100.dll the fast/scalable "Windows 7 system allocator" often referenced as "State of the Art"?

On Linux, I've seen dramatic performance gains in multithreaded code by changing the allocator, but I've never experimented with this on Windows -- thanks.

@RogerRowland: I profiled the code through random sampling and found malloc is a bottleneck. — JDiMatteo, Nov 06 '14 at 20:20
Ok, but is it a bottleneck because it's "slow" or because you're calling it too often? I'm not surprised by anything you've shown - locking is to be expected in a multithreaded situation. — Roger Rowland, Nov 06 '14 at 20:21
@RogerRowland: some allocators have better locking mechanisms than others to allow better scalability. For example tcmalloc has thread specific memory caches to allow multiple threads to allocate memory concurrently with no locks in many cases. The code I'm profiling is probably calling malloc too much and it would be better to have no dynamic memory allocation in the critical loop, but that isn't practical with large legacy code bases. — JDiMatteo, Nov 06 '14 at 20:24
Yes, I understand your point, it's just that this smells a bit like the XY problem. You seem to have identified a cause without properly explaining how the problem has arisen. What you see in the call stack is the MSVC runtime doing what it naturally does - it may not be as fast as you like but replacing it with something else without knowing the context could leave you with more problems. — Roger Rowland, Nov 06 '14 at 20:29
@RogerRowland: I agree with your point, but perhaps you assume I understand more than I do. I am not asking how to replace my malloc implementation nor whether it would be a good idea to do so -- I am simply asking what malloc implementation I am using with maybe a link to some details about my particular version of this implementation. All I know at this point is that "msvcr" probably stands for "Microsoft Visual C Runtime" with version 100 probably corresponding to Visual Studio 2010. Is there really no public information beyond that? — JDiMatteo, Nov 06 '14 at 20:40
There's nothing that I know of, but then I haven't looked ;-) However, [this related question](http://stackoverflow.com/q/858592/2065121) is worth a read for some more opinions. My gut feeling is that you should forget about the low level details until you've exhausted the high level design issues as far as is possible (I understand the problem with legacy code though). — Roger Rowland, Nov 06 '14 at 20:44

score 2 · Accepted Answer · edited May 23 '17 at 10:30

am simply asking what malloc implementation I am using with maybe a link to some details about my particular version of this implementation.

The callstack you are seeing indicates that the MSVCRT (more exactly, it default operator new => malloc are calling into the Win32 Heap functions. (I do not know whether malloc routes all requests directly to the CRT's Win32 Heap, or whether it does some additional caching - but if you have VS, you should have the CRT source code too, so should be able to check that.) (The Windows Internals book also talk about the Heap.)

General advice I can give is that in my experience (VS 2005, but judging from Hans' answer on the other question VS2010 may be similar) the multithreaded performance of the CRT heap can cause noticeable problems, even if you're not doing insane amounts of allocations.

That RtlEnterCriticalSection is just that, a Win32 Critical Section: Cheap to lock with low contention, but with higher you will see suboptimal runtime behaviour. _{(Bah! Ever tried to profile / optimize code that coughs on synchronization performance? It's a mess.)}

One solution is to split the heaps: Using different Heaps has given us significant improvements, even though each heap still is MT enabled (no HEAP_NO_SERIALIZE).

Since you're "coming in" via operator new, you might be able to use different allocators for some of the different classes that are allocated often. Or maybe some of your containers could benefit from custom allocators (that then use a separate heap).

_{One case we had, was that we were using libxml2 for XML parsing, and while building up the DOM tree, it simply swamps the system in malloc calls. Luckily, it uses its own set of memory allocation routines that can be easily replaced by a thin wrapper over the Win32 Heap functions. This gave us huge improvements, as XML parsing didn't interfere with the rest of the system's allocations anymore.}

How to find details of the Windows C++ memory allocator I am using?

1 Answers1