`std::string` allocations are my current bottleneck - how can I optimize with a custom allocator?

Question

I'm writing a C++14 JSON library as an exercise and to use it in my personal projects.

By using callgrind I've discovered that the current bottleneck during a continuous value creation from string stress test is an std::string dynamic memory allocation. Precisely, the bottleneck is the call to malloc(...) made from std::string::reserve.

I've read that many existing JSON libraries such as rapidjson use custom allocators to avoid malloc(...) calls during string memory allocations.

I tried to analyze rapidjson's source code but the large amount of additional code and comments, plus the fact that I'm not really sure what I'm looking for, didn't help me much.

How do custom allocators help in this situation?
- Is a memory buffer preallocated somewhere (where? statically?) and std::strings take available memory from it?
Are strings using custom allocators "compatible" with normal strings?
- They have different types. Do they have to be "converted"? (And does that result in a performance hit?)

Code notes:

Str is an alias for std::string.

@texasbruce: Sorry for not mentioning it. It's an alias for `std::string`. — Vittorio Romeo, Sep 30 '14 at 22:30
It's a bottleneck how? You're going to end up allocating memory anyway. — Rapptz, Sep 30 '14 at 22:58
I've personally dropped in `boost::string_ref` for `std::string` in my parser and everything was peaches. See related: http://stackoverflow.com/questions/24122557/how-to-parse-mustache-with-boost-xpressive-correctly/24131286#24131286 There's many many ways to skin a cat. — sehe, Sep 30 '14 at 23:06
Looked at your code and seeing reinterpret_casts, why?!? `return reinterpret_cast(*this);` Did you mean static_cast or dynamic_cast? — Neil Kirk, Oct 01 '14 at 00:13
Create a version of `readStr` that takes a reference to an existing string, so in loops you can pass in and reuse a string object declared outside of the loop. Although not as elegant as returning new objects from functions, this pattern is more efficient. — Neil Kirk, Oct 01 '14 at 00:19
The `reinterpret_cast` that Neil mentioned is just wrong. There are no guarantees whatsoever about the result. The correct cast for known-safe downcasts is `static_cast`. As a rule of thumb, if you're not sure what the right cast is, then it is not `reinterpret_cast`. — R. Martinho Fernandes, Oct 02 '14 at 12:30
@R.MartinhoFernandes, @Neil Kirk: how is it wrong? Is there any danger even if I'm 100% sure that the object is actually of type `TDerived` [like in this example?](http://ideone.com/XpZjj0) — Vittorio Romeo, Oct 02 '14 at 12:39
@VittorioRomeo It's wrong because it doesn't give any guarantees whatsoever about the result. It doesn't have to point to the same object. It doesn't even have to point to the same complete object. It doesn't have to "point" at all. Moreover, your apparent assumption that the address of the derived class will be the same as the address of the base class is wrong; that isn't necessarily the case, and one does not need a far-fetched example to show that (add a virtual member to the derived class, or use multiple inheritance). `static_cast` always works, though. (ran out of comment space) — R. Martinho Fernandes, Oct 02 '14 at 12:42
To sum up: it is wrong because it doesn't do what you want (a downcast). What you want is done by `static_cast`. — R. Martinho Fernandes, Oct 02 '14 at 12:43
@R.MartinhoFernandes: Thank you very much for explaining the potential issues. I'm replacing every similar `reinterpret_cast` usage with a `static_cast` wrapper called `upCast` that checks const correctness and checks if the casted object type is a base class of `T` via `static_assert`. — Vittorio Romeo, Oct 02 '14 at 14:34
Use a define which is static_cast in release mode and dynamic_cast in debug mode. — Neil Kirk, Oct 03 '14 at 12:38
Did you try my other suggestion? There is pretty C++ and then there is fast C++. My way is not that ugly actually but much faster. The best way to avoid the cost of allocations is to avoid allocations in the first place. Any custom allocator will still carry some overhead. — Neil Kirk, Oct 03 '14 at 12:40
@NeilKirk: Well, it would be faster but I don't think it can be applied here. The creation of a new JSON string value requires some sort of allocation. Even if I had an external string to "fill", it would have to be allocated during parsing, I suppose — Vittorio Romeo, Oct 03 '14 at 12:51
I looked and you have some for loops. The idea is you move the string creation outside of the loop. — Neil Kirk, Oct 03 '14 at 13:51
Writing a custom string class optimised for your use case might be easier than writing a custom allocator. — Jonathan Wakely, Oct 03 '14 at 15:10

iwolf · Accepted Answer · 2014-10-03T21:50:04.753

7

By default, std::string allocates memory as needed from the same heap as anything that you allocate with malloc or new. To get a performance gain from providing your own custom allocator, you will need to be managing your own "chunk" of memory in such a way that your allocator can deal out the amounts of memory that your strings ask for faster than malloc does. Your memory manager will make relatively few calls to malloc, (or new, depending on your approach) under the hood, requesting "large" amounts of memory at once, then deal out sections of this (these) memory block(s) through the custom allocator. To actually achieve better performance than malloc, your memory manager will usually have to be tuned based on known allocation patterns of your use cases.

This kind of thing often comes down to the age-old trade off of memory use versus execution speed. For example: if you have a known upper bound on your string sizes in practice, you can pull tricks with over-allocating to always accommodate the largest case. While this is wasteful of your memory resources, it can alleviate the performance overhead that more generalized allocation runs into with memory fragmentation. As well as making any calls to realloc essentially constant time for your purposes.

@sehe is exactly right. There are many ways.

EDIT:

To finally address your second question, strings using different allocators can play nicely together, and usage should be transparent.

For example:

class myalloc : public std::allocator<char>{};
myalloc customAllocator;

int main(void)
{
  std::string mystring(customAllocator);
  std::string regularString = "test string";
  mystring = regularString;
  std::cout << mystring;

  return 0;
}

This is a fairly silly example and, of course, uses the same workhorse code under the hood. However, it shows assignment between strings using allocator classes of "different types". Implementing a useful allocator that supplies the full interface required by the STL without just disguising the default std::allocator is not as trivial. This seems to be a decent write up covering the concepts involved. The key to why this works, in the context of your question at least, is that using different allocators doesn't cause the strings to be of different type. Notice that the custom allocator is given as an argument to the constructor not a template parameter. The STL still does fun things with templates (such as rebind and Traits) to homogenize allocator interfaces and tracking.

edited Oct 03 '14 at 21:50

answered Oct 01 '14 at 00:08

iwolf

1,049
1
7
10

1

not always true, the stl string spec makes no assumptions of where memory comes from. Many (inc VC++) have a small buffer inside the class so if you have a 32-byte or less string, it'll get allocated in that, effectively on the stack, rather than on the heap. – gbjbaanb Oct 01 '14 at 07:33
Where should this allocator live? Should it have a `static` lifetime? - Can `std::string` instances that use this allocator easily work with normal strings? – Vittorio Romeo Oct 01 '14 at 07:39
1

Please don't make perfectly concurrency-able functions use `static` globals. Anyone using multiple threads will want to kill you for it. Mutable globally shared state in a library is in general what I would call a dick move. – R. Martinho Fernandes Oct 02 '14 at 12:34
@R.MartinhoFernandes: I agree with you. But libraries like `rapidjson` do not require the user to instantiate an allocator somewhere before use. I was wondering how do they avoid that - would using `thread_local` work? – Vittorio Romeo Oct 02 '14 at 13:36
Btw: if you go through the trouble of writing an optimized memory allocator, you can just as well override the global one by defining the functions `void* operator new (size_t size)` and `void operator delete (void *pointer)`. Of course, if you use this approach, you must back your allocations either with `malloc()` or `mmap()` since calling `new` would recurse back into your own function. – cmaster - reinstate monica Oct 03 '14 at 07:54
Sorry, but your whole edit is just wrong: The allocatortype IS part of the type (btw.: `std::string` is just a typedef for `std::basic_string,std::allocator>`). The reason, why your code compiles, is because, you derive from `std::allocator`, so your allocator is a valid argument for a parameter of type `std::allocator`. Your string, however, doesn't use the passed allocator, but a SLICED COPY (so just a normal std::allocator). So the reason, why you can assign `regularString` to `mystring` is because, they are using the same allocator type. – MikeMB Sep 10 '15 at 15:44
The stuff about rebind you are mentioning has nothing to do with type erasure, but allows a container, to use that allocator to allocate memory for internal types that are not known to the user, like e.g. a node class in a linked list. – MikeMB Sep 10 '15 at 15:49
@MikeMB: You got me. When I wrote this, I assumed there would be polymorphism where there is none. But I guess the point is to send your compiler through angle-bracket-hell to save dynamic calls at runtime. – iwolf Oct 14 '15 at 23:45

StarShine · Answer 2 · 2017-11-17T12:18:41.707

What often helps is the creation of a GlobalStringTable.

See if you can find portions of the old NiMain library from the now defunct NetImmerse software stack. It contains an example implementation.

Lifetime

What is important to note is that this string table needs to be accessible between different DLL spaces, and that it is not a static object. R. Martinho Fernandes already warned that the object needs to be created when the application or DLL thread is created / attached, and disposed when the thread is destroyed or the dll is detached, and preferrably before any string object is actually used. This sounds easier than it actually is.

Memory allocation

Once you have a single point of access that exports correctly, you can have it allocate a memory buffer up-front. If the memory is not enough, you have to resize it and move the existing strings over. Strings essentially become handles to regions of memory in this buffer.

Placement new

Something that often works well is called the placement new() operator, where you can actually specify where in memory your new string object needs to be allocated. However, instead of allocating, the operator can simply grab the memory location that is passed in as an argument, zero the memory at that location, and return it. You can also keep track of the allocation, the actual size of the string etc.. in the Globalstringtable object.

SOA

Handling the actual memory scheduling is something that is up to you, but there are many possible ways to approach this. Often, the allocated space is partitioned in several regions so that you have several blocks per possible string size. A block for strings <= 4 bytes, one for <= 8 bytes, and so on. This is called a Small Object Allocator, and can be implemented for any type and buffer.

If you expect many string operations where small strings are incremented repeatedly, you may change your strategy and allocate larger buffers from the start, so that the number of memmove operations are reduced. Or you can opt for a different approach and use string streams for those.

String operations

It is not a bad idea to derive from std::basic_str, so that most of the operations still work but the internal storage is actually in the GlobalStringTable, so that you can keep using the same stl conventions. This way, you also make sure that all the allocations are within a single DLL, so that there can be no heap corruption by linking different kinds of strings between different libraries, since all the allocation operations are essentially in your DLL (and are rerouted to the GlobalStringTable object)

score 3 · Answer 3 · edited Oct 03 '14 at 12:35

3

I think you'd be best served by reading up on the EASTL

It has a section on allocators and you might find fixed_string useful.

edited Oct 03 '14 at 12:35

Niall

28,102
9
90
124

answered Sep 30 '14 at 22:45

gbjbaanb

49,287
10
99
143

1

Fixed string is particularly useful if you do a lot of string compares, it does not alleviate the allocation burden. – StarShine Oct 03 '14 at 12:39

score 3 · Answer 4 · answered Oct 03 '14 at 08:07

3

Custom allocators can help because most malloc()/new implementations are designed for maximum flexibility, thread-safety and bullet-proof workings. For instance, they must gracefully handle the case that one thread keeps allocating memory, sending the pointers to another thread that deallocates them. Things like these are difficult to handle in a performant way and drive the cost of malloc() calls.

However, if you know that some things cannot happen in your application (like one thread deallocating stuff another thread allocated, etc.), you can optimize your allocator further than the standard implementation. This can yield significant results, especially when you don't need thread safety.

Also, the standard implementation is not necessarily well optimized: Implementing void* operator new(size_t size) and void operator delete(void* pointer) by simply calling through to malloc() and free() gives an average performance gain of 100 CPU cycles on my machine, which proves that the default implementation is suboptimal.

answered Oct 03 '14 at 08:07

cmaster - reinstate monica

33,875
7
50
100

How does your replacement `operator new` handle allocation failure? Are you sure the default implementation is suboptimal, rather than simply standard conforming? Most standard library implementations are fairly well optimised already. – Jonathan Wakely Oct 03 '14 at 15:08
@JonathanWakely Of course, you are free to check whether `malloc()` returned NULL and throw an exception as the standard requires. That is one comparison against zero in the normal path, which costs much less than 100 CPU cycles. So yes, I am dead certain that the standard implementation is suboptimal on my system, even though I didn't bother to check the return value: You will generally not see `malloc()` returning NULL on a modern system because kernels overcommit their memory. You'll get shot by the OOM-killer instead... – cmaster - reinstate monica Oct 03 '14 at 15:33
Gcc's https://gcc.gnu.org/viewcvs/gcc/trunk/libstdc%2B%2B-v3/libsupc%2B%2B/new_op.cc?view=markup#l42 should be well optimised and still fully conforming – Jonathan Wakely Oct 03 '14 at 15:49
From looking at the source code, I'd say the problem is not only checking for malloc returning 0, but also handling objects of size zero and the necessary error handling required by the standard, which probably influences inlining. So I tend to agree with Jonathan here - the standard just doesn't allow for further optimizations. But that is essentially what you are saying in your first two paragraphs: The fewer gurantees you have to provide, the more efficient your code becomes (one of the reasons for all those undefined behavior clauses in the standard) – MikeMB Sep 10 '15 at 17:38
@MikeMB This SO answer (http://stackoverflow.com/a/1087066/2445184) neatly summarizes the requirements on `operator new(size_t)` for zero size requests. Basically, if you want to implement a perfectly standard compliant version that calls through to `malloc()`, you only need to check if the size is zero (calling `malloc(1)` in that case), and check if the return value is `NULL` (throwing an exception in that case). These two checks will virtually fail every time, so it's only the check itself that's relevant to performance. And such tests will never consume 50 cycles each. – cmaster - reinstate monica Sep 10 '15 at 19:03
1

Have you even looked at gcc's implementation? That is exactly what they are doing. So obviously, it does cost 50 cycles for some reason. As I said, I can only imagine that this is somehow related to different inlining behavior, or maybe it affects other compiler optimizations. Btw. The new operator is not just supposed to throw an exception, but to call the `new_handler` if such a handler is installed. But as you said, those instructions are never executed in your typical benchmark. – MikeMB Sep 11 '15 at 04:46

score 2 · Answer 5 · answered Oct 03 '14 at 12:11

The best way to avoid a memory allocation is don't do it!
BUT if I remember JSON correctly all the readStr values either gets used as keys or as identifiers so you will have to allocate them eventually, std::strings move semantics should insure that the allocated array are not copied around but reused until its final use. The default NRVO/RVO/Move should reduce any copying of the data if not of the string header itself.

Method 1:
Pass result as a ref from the caller which has reserved SomeResonableLargeValue chars, then clear it at the start of readStr. This is only usable if the caller actually can reuse the string.

Method 2:
Use the stack.

// Reserve memory for the string (BOTTLENECK)
if (end - idx < SomeReasonableValue) { // 32?
  char result[SomeReasonableValue] = {0};  // feel free to use std::array if you want bounds checking, but the preceding "if" should insure its not a problem.
  int ridx = 0;

  for(; idx < end; ++idx) {
    // Not an escape sequence
    if(!isC('\\')) { result[ridx++] = getC(); continue; }
    // Escape sequence: skip '\'
    ++idx;
    // Convert escape sequence
    result[ridx++] = getEscapeSequence(getC());
  }

  // Skip closing '"'
  ++idx;
  result[ridx] = 0; // 0-terminated.
  // optional assert here to insure nothing went wrong.
  return result; // the bottleneck might now move here as the data is copied to the receiving string.
}
// fallback code only if the string is long.
// Your original code here

Method 3:
If your string by default can allocate some size to fill its 32/64 byte boundary, you might want to try to use that, construct result like this instead in case the constructor can optimize it.

Str result(end - idx, 0);

Method 4:
Most systems already has some optimized allocator that like specific block sizes, 16,32,64 etc.

siz = ((end - idx)&~0xf)+16; // if the allocator has chunks of 16 bytes already.
Str result(siz);

Method 5:
Use either the allocator made by google or facebooks as global new/delete replacement.

score 1 · Answer 6 · answered Oct 05 '14 at 22:09

To understand how a custom allocator can help you, you need to understand what malloc and the heap does and why it is quite slow in comparison to the stack.

The Stack

The stack is a large block of memory allocated for your current scope. You can think of it as this

([] means a byte of memory)

[P][][][][][][][][][][][][][][][]

(P is a pointer that points to a specific byte of memory, in this case its pointing at the first byte)

So the stack is a block with only 1 pointer. When you allocate memory, what it does is it performs a pointer arithmetic on P, which takes constant time. So declaring int i = 0; would mean this,

P + sizeof(int).

[i][i][i][i][P][][][][][][][][][][][], (i in [] is a block of memory occupied by an integer)

This is blazing fast and as soon as you go out of scope, the entire chunk of memory is emptied simply by moving P back to the first position.

The Heap

The heap allocates memory from a reserved pool of bytes reserved by the c++ compiler at runtime, when you call malloc, the heap finds a length of contiguous memory that fits your malloc requirements, marks it as used so nothing else can use it, and returns that to you as a void*.

So, a theoretical heap with little optimization calling new(sizeof(int)), would do this.

Heap chunk

At first : [][][][][][][][][][][][][][][][][][][][][][][][][]

Allocate 4 bytes (sizeof(int)): A pointer goes though every byte of memory, finds one that is of correct length, and returns to you a pointer. After : [i][i][i][i][][][]][][][][][][][][][]][][][][][][][]

This is not an accurate representation of the heap, but from this you can already see numerous reasons for being slow relative to the stack.

The heap is required to keep track of all already allocated memory and their respective lengths. In our test case above, the heap was already empty and did not require much, but in worst case scenarios, the heap will be populated with multiple objects with gaps in between (heap fragmentation), and this will be much slower.
The heap is required to cycle though all the bytes to find one that fits your length.
The heap can suffer from fragmentation since it will never completely clean itself unless you specify it. So if you allocated an int, a char, and another int, your heap would look like this

[i][i][i][i][c][i2][i2][i2][i2]

(i stands for bytes occupied by int and c stands for bytes occupied by a char. When you de-allocate the char, it will look like this.

[i][i][i][i][empty][i2][i2][i2][i2]

So when you want to allocate another object into the heap,

[i][i][i][i][empty][i2][i2][i2][i2][i3][i3][i3][i3]

unless an object is the size of 1 char, the overall heap size for that allocation is reduced by 1 byte. In more complex programs with millions of allocations and deallocations, the fragmentation issue becomes severe and the program will become unstable.

Worry about cases like thread safety (Someone else said this already).

Custom Heap/Allocator

So, a custom allocator usually needs to address these problems while providing the benefits of the heap, such as personalized memory management and object permanence.

These are usually accomplished with specialized allocators. If you know you dont need to worry about thread safety or you know exactly how long your string will be or a predictable usage pattern you can make your allocator fast than malloc and new by quite a lot.

For example, if your program requires a lot of allocations as fast as possible without lots of deallocations, you could implement a stack allocator, in which you allocate a huge chunk of memory with malloc at startup,

e.g

typedef char* buffer;
//Super simple example that probably doesnt work.
struct StackAllocator:public Allocator{
     buffer stack;
     char* pointer;
     StackAllocator(int expectedSize){ stack = new char[expectedSize];pointer = stack;}
     allocate(int size){ char* returnedPointer = pointer; pointer += size; return returnedPointer}
     empty() {pointer = stack;}

};

Get expected size, get a chunk of memory from the heap.

Assign a pointer to the beginning.

[P][][][][][][][][][] ..... [].

then have one pointer that moves for each allocation. When you no longer need the memory, you simply move the pointer to the beginning of your buffer. This gives your the advantage of O(1) speed allocations and deallocations as well as object permanence for the lack of flexible deallocation and large initial memory requirements.

For strings, you could try a chunk allocator. For every allocation, the allocator gives a set chunk of memory.

Compatibility

Compatibility with other strings is almost guaranteed. As long as you are allocating a contiguous chunk of memory and preventing anything else from using that block of memory, it will work.

`std::string` allocations are my current bottleneck - how can I optimize with a custom allocator?

6 Answers6

Linked