Windows malloc replacement (e.g., tcmalloc) and dynamic crt linking

Question

A C++ program that uses several DLLs and QT should be equipped with a malloc replacement (like tcmalloc) for performance problems that can be verified to be caused by Windows malloc. With linux, there is no problem, but with windows, there are several approaches, and I find none of them appealing:

1. Put new malloc in lib and make sure to link it first (Other SO-question)

This has the disadvantage, that for example strdup will still use the old malloc and a free may crash the program.

2. Remove malloc from the static libcrt library with lib.exe (Chrome)

This is tested/used(?) for chrome/chromium, but has the disadvantage that it just works with static linking the crt. Static linking has the problem if one system library is linked dynamically against msvcrt there may be mismatches in the heap allocation/deallocation. If I understand it correctly, tcmalloc could be linked dynamically such that there is a common heap for all self-compiled dlls (which is good).

3. Patch crt-source code (firefox)

Firefox's jemalloc apparently patches the windows CRT source code and builds a new crt. This has again the static/dynamic linking problem above.

One could think of using this to generate a dynamic MSVCRT, but I think this is not possible, because the license forbids providing a patched MSVCRT with the same name.

4. Dynamically patching loaded CRT at run time

Some commercial memory allocators can do such magic. tcmalloc can do, too, but this seems rather ugly. It had some issues, but they have been fixed. Currently, with tcmalloc it does not work under 64 bit windows.

Are there better approaches? Any comments?

So which approach did you use? Which one did you use to verify the assertion that the alternate allocator worked better than the one provided with the CRT malloc? Which version of the CRT did you use and is it better/worse/the same as newer versions? — Adrian McCarthy, May 29 '12 at 20:22
Why not try replace the global C++ new? Wouldn't that work (and match the shared libs + app main binary + ms crt as a shared lib setup)? — mlvljr, Apr 01 '13 at 19:49

score 8 · Answer 1 · answered May 19 '09 at 12:23

Q: A C++ program that is split accross several dlls should:

A) replace malloc?

B) ensure that allocation and de-allocation happens in the same dll module?

A: The correct answer is B. A c++ application design that incorporates multiple DLLs SHOULD ensure that a mechanism exists to ensure that things that are allocated on the heap in one dll, are free'd by the same dll module.

Why would you split a c++ program into several dlls anyway? By c++ program I mean that the objects and types you are dealing with are c++ templates, STL objects, classes etc. You CAN'T pass c++ objects accross dll boundries without either lot of very careful design and lots of compiler specific magic, or suffering from massive duplication of object code in the various dlls, and as a result an application that is extremely version sensitive. Any small change to a class definition will force a rebuild of all exe's and dll's, removing at least one of the major benefits of a dll approach to app development.

Either stick to a straight C interface between app and dll's, suffer hell, or just compile the entire c++ app as one exe.

Adrian McCarthy · Answer 2 · 2014-05-07T23:10:11.460

It's a bold claim that a C++ program "should be equipped with a malloc replacement (like tcmalloc) for performance problems...."

"[In] 6 out of 8 popular benchmarks ... [real-sized applications] replacing back the custom allocator, in which people had invested significant amounts of time and money, ... with the system-provided dumb allocator [yielded] better performance. ... The simplest custom allocators, tuned for very special situations, are the only ones that can provide gains." --Andrei Alexandrescu

Most system allocators are about as good as a general purpose allocator can be. You can do better only if you have a very specific allocation pattern.

Typically, such special patterns apply only to a portion of the program, in which case, it's better to apply the custom allocator to the specific portion that can benefit than it is to globally replace the allocator.

C++ provides a few ways to selectively replace the allocator. For example, you can provide an allocator to an STL container or you can override new and delete on a class by class basis. Both of these give you much better control than any hack which globally replaces the allocator.

Note also that replacing malloc and free will not necessarily change the allocator used by operators new and delete. While the global new operator is typically implemented using malloc, there is no requirement that it do so. So replacing malloc may not even affect most of the allocations.

If you're using C, chances are you can wrap or replace key malloc and free calls with your custom allocator just where it matters and leave the rest of the program to use the default allocator. (If that's not the case, you might want to consider some refactoring.)

System allocators have decades of development behind them. They are stable and well-tested. They perform extremely well for general cases (in terms of raw speed, thread contention, and fragmentation). They have debugging versions for leak detection and support for tracking tools. Some even improve the security of your application by providing defenses against heap buffer overrun vulnerabilities. Chances are, the libraries you want to use have been tested only with the system allocator.

Most of the techniques to replace the system allocator forfeit these benefits. In some cases, they can even increase memory demand (because they can't be shared with the DLL runtime possibly used by other processes). They also tend to be extremely fragile in the face of changes in the compiler version, runtime version, and even OS version. Using a tweaked version of the runtime prevents your users from getting benefits of runtime updates from the OS vendor. Why give all that up when you can retain those benefits by applying a custom allocator just to the exceptional part of the program that can benefit from it?

sean e · Answer 3 · 2009-05-16T19:46:37.300

1

Where does your premise "A C++ program that uses several DLLs and QT should be equipped with a malloc replacement" come from?

On Windows, if the all the dlls use the shared MSVCRT, then there is no need to replace malloc. By default, Qt builds against the shared MSVCRT dll.

One will run into problems if they:

1) mix dlls that use static linking vs using the shared VCRT

2) AND also free memory that was not allocated where it came from (ie, free memory in a statically linked dll that was allocated by the shared VCRT or vice versa).

Note that adding your own ref counted wrapper around a resource can help mitigate that problems associated with resources that need to be deallocated in particular ways (ie, a wrapper that disposes of one type of resource via a call back to the originating dll, a different wrapper for a resource that originates from another dll, etc).

edited May 16 '09 at 19:46

answered May 16 '09 at 02:51

sean e

11,290
3
40
54

5

The premise comes from large performance gains when using tcmalloc instead the MSVCRT. – Weidenrinde May 25 '09 at 15:56
1

If that's the case, then a C program or even a C++ program that doesn't use Qt could benefit from the change. However, I would disagree that they "should be equipped with a malloc replacement" until profiling indicates that MSVCRT is insufficient. – sean e May 25 '09 at 16:44
2

1. The measured, large performance improvements were clearly related to MSVCRT's memory management. The impact depends on the allocation style of the program, thus I do not claim that MSVCRT is bad. 2. I just mentioned Qt to indicate that several sub-applications use the QT-lib and therefore static linking is not the preferred option. – Weidenrinde May 26 '09 at 13:21
In the first sentence of your question, you write that "C++ programs... should be equipped with a malloc replacement" - it only follows that you assume MSVCRT performs poorly. That sentence is written as a prescription: "C++ applications should replace malloc". That's my only issue with the premise - it is highly context dependent whether or not MSVCRT performance is insufficient. Regarding Qt, yes, static link is not preferred but again it was how Qt was presented in the problem that I took issue with. – sean e May 28 '09 at 02:35
After profiling has shown what allocations are hurting performance, the use of an additional allocator could be limited to the objects/resources that are affected by the performance issue. It may not be the case that allocation of everything, for example QObject derived classes, need wholesale allocator replacement. – sean e May 28 '09 at 02:38
1

Read carefully: I said "A program" not "C++ programms". MSVCRT has a special small object treatment of the heap, but from what I understood it has a fixed maximum size of small objects. For the particular program this behaves badly. – Weidenrinde Jun 04 '09 at 08:04
1

For the particular case I found no way to isolate which allocations make the problem. My conjecture is that the general pattern of allocation generates fragmentation and this slows down more or less everything. In the end it seems easier to replace the overall malloc procedure than to manually change the allocator only in many parts. – Weidenrinde Jun 04 '09 at 08:05
Well, what was actually written was "A C++ program". The way it is written lead me to believe that was written was prescriptive behavior for all C++ programs. Consider the sentence "A tired person should get some sleep" - does "a tired person" refer to one particular person or any person in general that is tired? I see now, that what you meant was "I have a C++ program that uses several DLLs and QT in which for performance reasons I need to replace malloc". I only took issue because it seemed that you were espousing a premise for all C++ programs. – sean e Jun 04 '09 at 15:26
2

Replacing the system memory allocation strategy is often desirable, due to the performance characteristics of said system memory allocator versus alternatives (such as dlmalloc, which I have used extensively). Clearly, not requiring memory allocation and using only static pools would get around the problem, but if one is inheriting code that does a lot of memory allocation and calls malloc() and friends, performance can usually be increased by going with an alternative. The default may be sufficient for many applications, but it is by no means the performance leader. – dash-tom-bang Sep 25 '09 at 00:18

rogerdpack · Answer 4 · 2012-05-30T16:09:29.317

1

nedmalloc? also NB that smplayer uses a special patch to override malloc, which may be the direction you're headed in.

edited May 30 '12 at 16:09

answered May 20 '09 at 12:24

rogerdpack

50,731
31
212
332

Do you know, how nedmalloc treats the problem? – Weidenrinde May 26 '09 at 13:12
not sure, though I know it "doesn't automagically replace the system malloc()" http://www.nedprod.com/programs/portable/nedmalloc/ might have more – rogerdpack Jun 18 '09 at 15:39
In nedmalloc source code, there's a winpatcher. Looks like it's doing the (4) magic, and that's probably what you want. Haven't checked, though, if x64 windows is supported. – zerm Feb 13 '12 at 16:13

Windows malloc replacement (e.g., tcmalloc) and dynamic crt linking

4 Answers4

Linked