-2

I'm not sure if this have been asked before, so I'll give it a try.

I have code for loading of large clients list (200k clients). Every client is stored in a (currently) fixed-size struct that contains his name, address and phone number as follow:

struct client {
    char name[80];
    char address[80];
    char phonenumber[80];
};

As you can see, the size of this struct is 240 bytes. So 200k clients would take 48MB of memory. Obviously advantages of such a structure is the ease of management and creating a "free-list" for recycling clients. However, if tommorow I needed to load 5M clients, then this would grow to 1.2Gb of RAM.

Now, obviously in most cases, the client's name, address and phone number take much less than 80 bytes, so instead of the above structure I thought of using a structure as the following:

struct client {
    char *name;
    char *address;
    char *phonenumber;
};

And then have *name, *address and *phonenumber point to dynamically allocated structures at the exact needed size for storing each information.

I do suspect however, that as more clients are loaded this way, it would greatly increase the number of new[] and delete[] allocations needed, and my question is if this can hurt performance at some point, for example if I want to suddenly delete 500k of the 1M clients and replace them with 350k different clients?

I am suspecting whether after I allocated 1M "variable length" small buffers, if I "delete" many of them and then want to create new allocations that would recycle the ones that were deleted, won't it cause some overhead for the allocator to find them?

DevSolar
  • 59,831
  • 18
  • 119
  • 197
Miki Berkovich
  • 457
  • 5
  • 15
  • 4
    i can only speak for myself, my limit is 0. Suggested reading: https://stackoverflow.com/questions/6500313/why-should-c-programmers-minimize-use-of-new – 463035818_is_not_a_number Jan 16 '20 at 15:03
  • 9
    XY problem. Why is your (C++) struct constructed of (C) arrays / char pointers and not `std::string`? Why are you reading all your clients into memory, instead of using a database backend when your number of clients is that big? Could it be that they [taught you C first and you got stuck half-way?](https://www.youtube.com/watch?v=YnWhqhNdYyk) (No offense intended, many people suffered from this.) – DevSolar Jan 16 '20 at 15:03
  • 11
    [Pro Tip] **Don't** use `new`/`delete`. If you need a string, use `std::string`. If you need a dynamic array, use `std::vector`. Unless you are taking a class and have to do manual memory allocations, don't. They are the cause of a good percentage of bugs. – NathanOliver Jan 16 '20 at 15:03
  • 2
    FWIW, all the recommendations in the comments miss the point of your question (as `std::string`/`std::vector` generally allocate heap memory). But the answer to your question depends on your system's heap allocator, which varies by platform/system and can even be customized by you if you find that a different one performs better. Research heap fragmentation and different allocators. – Max Langhof Jan 16 '20 at 15:10
  • 2
    Though I should note that `std::string` allows for small string optimizations, which generally means that (on 64 bit platforms) no heap allocation happens when less than 16 characters are used. Definitely appealing for e.g. phone number. – Max Langhof Jan 16 '20 at 15:12
  • @MaxLanghof: Less than 21 characters, if the implementation is any good. And if OP *really* needs to load all those clients into memory, using a different allocator to `std::string` would also be of help. More so than going for C strings, anyway. – DevSolar Jan 16 '20 at 15:13
  • 2
    ***is there a limit to how many new[] & delete[] allocations are allowed before program becomes inefficient?*** No, you probably need to profile. Also make sure you do this in Release / optimized mode because some compilers add a lot of overhead (time and extra space) for debug mode allocations / deallocations. – drescherjm Jan 16 '20 at 15:15
  • 3
    You could even write `using telephone_t = std::string`. If at a later point profiling would show that you've got a bottleneck there, you can write an optimized class, but the ` using` gets you started now. – MSalters Jan 16 '20 at 15:18
  • The question is how fast are std::string and std::vector compared to char[]? Since char[] is a direct pointer to a memory location, and in my program I do duplicate checks and I need them to work as fast as possible. I assume that std::string and std::vector are classes, so accessing information in them is not a direct pointer like char[].. correct me if I'm wrong? – Miki Berkovich Jan 16 '20 at 15:43
  • 1
    A common `std::string` implementation holds one pointer (to memory) and two counters (size, capacity). On a 64bit machine that's 24 bytes. With a bit of internal trickery, a string that would fit into 23 bytes is stored *directly* in that memory, without any dynamic memory allocation (i.e., *faster* than a `malloc`'ed `char*`). If the string is larger, memory is allocated, and the data pointer is pointing to that. A comparison between two `std::string` is just as fast as a `strncmp` on C strings. – DevSolar Jan 16 '20 at 15:56
  • There is no limit. However, with each allocation and deallocation, memory becomes more fragmented. The degree of fragmentation will be higher with small memory capacities than with larger capacities. In embedded systems that can't be easily rebooted, dynamic memory is minimized by using static, automatic or global arrays. Data that is not changed is placed into Read-Only segments (by declaring as `static const`). This may free up dynamic memory or stack memory depending on the architecture of the embedded system. – Thomas Matthews Jan 16 '20 at 17:29
  • Back in *ancient times*, when memory was limited (small), data was stored outside of the program (e.g. tapes or hard drives) and only pieces (chunks) were loaded and operated on as necessary. You may want to use this concept or the concept of *virtual memory*. Many operating systems have API to manage (allocate, deallocate) virtual memory. Some OS's may also have APIs to memory map files (treat files as memory). – Thomas Matthews Jan 16 '20 at 17:33

2 Answers2

6

The answer is that there is some overhead (both in terms of per-allocation CPU cycles and in per-allocation book-keeping memory) to making many small dynamic allocations and deallocations. How much overhead will depend a lot on how your runtime's memory heap was implemented; however, most modern/popular runtimes have heap implementations that have been optimized to be quite efficient. There are some articles about how various OS's heaps are implemented that you can read about to get an idea about how they work.

In a modern heap implementation, your program probably won't "hit the wall" and grind to a halt when there are "too many" heap allocations (unless your computer actually runs out of physical RAM, of course), but it will use up proportionally more RAM and CPU cycles than a comparable program that doesn't require so many.

Given that, using a zillion tiny memory allocations is probably not the best way to go. In addition to being less than optimally efficient (since every one of those tiny allocations will require a separate block of book-keeping bytes to keep track of), lots of tiny allocations can lead to memory fragmentation problems (which are less of an issue on modern 64-bit systems with virtual memory, but still something to consider), as well as being difficult to manage correctly (it's easy to end up with memory leaks or double-frees if you are doing your allocations manually).

As others have suggested in the comments, calling new and delete explicitly is discouraged in C++; it's almost always better to use higher-level data structures (e.g. std::string, std::map, std::vector, etc, or even a proper database layer instead), since by doing it that way a lot of the difficult design work will have been done for you, saving you the pain of having to re-discover and re-solve all of the problems that others have already dealt with in the past. For example, std::string already implements the short-string-optimization that allows strings shorter than a certain number of bytes to be stored without requiring a separate heap allocation; similar to the tradeoff you are trying to make in your own designs, except you get that optimization "for free", when appropriate, simply by using std::string to store your string-data.

Jeremy Friesner
  • 57,675
  • 12
  • 103
  • 196
  • Thanks for the answer and the attached links in it. The reason I use char[] is for speed as one of the things I do in my program is duplicate checks, and they are quite fast (using a hashtable, I run 20k and more duplicate checks against 1M clients in a second). How fast are std::string and std::vector for this kind of a requirement? – Miki Berkovich Jan 16 '20 at 15:38
  • 1
    @MikiBerkovich: So you hamstrung yourself into using C `char*` strings *believing* they are faster than C++ `std::string`, and now want profiling info on whether they actually are? ;-) As I said in another comment already, if the lots of small allocations are bothering you, you could drop-in a custom allocator (that does one big allocation and then parcels out that memory to individual `std::string` instances) and check how much faster that works. – DevSolar Jan 16 '20 at 15:45
  • @DevSolar I'm just examining a few alternatives for my "fixed-size" approach (which works quite fast but not so memory efficient). That's why I posted this question. – Miki Berkovich Jan 16 '20 at 15:58
  • @MikiBerkovich: What happens to your fixed-size approach when a name or address doesn't fit into 80 characters? Because quite a few do. Also, how do you handle non-ASCII-7? If you're working on UTF-8, have you normalized your input? (There is more than one way to encode a given glyph, e.g. Ü could be UPPERCASE U WITH DIARAESIS or UPPERCASE U followed by COMBINING DIARAESIS. They would compare non-equal on a simple `strcmp`.) – DevSolar Jan 16 '20 at 15:59
  • @DevSolar for the moment my implementation is only for english texts. So I chose 80 bytes because that's way more than an average string on any of those fields actually takes, and if it happens that some text is longer than 80 bytes I just trim it. – Miki Berkovich Jan 16 '20 at 16:01
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206078/discussion-between-devsolar-and-miki-berkovich). – DevSolar Jan 16 '20 at 16:02
  • @MikiBerkovich for duplicate-checks you're going to get much better efficiency by using a keyed data-structure (such as `std::map` or `std::unordered_map`) that can do O(1) or O(logN) lookups, then by trying to make an O(1) exhaustive data-scan more efficient. – Jeremy Friesner Jan 16 '20 at 16:45
0

is there a limit to how many new[] & delete[] allocations are allowed before program becomes inefficient?

Even a single allocation will make the program less time efficient compared to a program that doesn't do that allocation, assuming that allocation isn't needed. The inefficiency scales (at least) linearly with the number of allocations (depending on the implementation of the allocation function).

There is no objective limit for when a program is efficient and when it is inefficient. If you're writing a program with a hard real time requirement, then you have a limit for when your program is too inefficient, but for other programs - which is most programs, there is no objective limit for when a program is too inefficient either. Generally, if your program takes too long to execute, then it can be perceived as inefficient by the user. "Too long" is subjective to whoever is using the program.

A better solution than what you suggest is to use std::string members. Now, its size may be some multiple of a pointer size (~4 depending on implementation), but (assuming decent implementation) it does magic and avoids dynamic allocation when the string fits within that space. This saves a ton of time compared to separate allocation for each, and ton of space compared to the in-place array. Even more importantly, it doesn't require error prone manual memory management.

The optimally memory efficient way to store your list of clients is a single massive array of char where each string is stored consecutively. You can use a pointer to a string to signify beginning of a client. If you don't want to do linear search for specific member, then you can use a pointer class like in your question, but point into this single array instead of separate allocations.

eerorika
  • 181,943
  • 10
  • 144
  • 256