Why malloc+memset is slower than calloc?

Question

It's known that calloc is different than malloc in that it initializes the memory allocated. With calloc, the memory is set to zero. With malloc, the memory is not cleared.

So in everyday work, I regard calloc as malloc+memset. Incidentally, for fun, I wrote the following code for a benchmark.

The result is confusing.

Code 1:

#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)calloc(1,BLOCK_SIZE);
                i++;
        }
}

Output of Code 1:

time ./a.out  
**real 0m0.287s**  
user 0m0.095s  
sys 0m0.192s

Code 2:

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)malloc(BLOCK_SIZE);
                memset(buf[i],'\0',BLOCK_SIZE);
                i++;
        }
}

Output of Code 2:

time ./a.out   
**real 0m2.693s**  
user 0m0.973s  
sys 0m1.721s

Replacing memset with bzero(buf[i],BLOCK_SIZE) in Code 2 produces the same result.

My question is: Why is malloc+memset so much slower than calloc? How can calloc do that?

Dietrich Epp · Accepted Answer · 2016-08-05T07:30:26.413

The short version: Always use calloc() instead of malloc()+memset(). In most cases, they will be the same. In some cases, calloc() will do less work because it can skip memset() entirely. In other cases, calloc() can even cheat and not allocate any memory! However, malloc()+memset() will always do the full amount of work.

Understanding this requires a short tour of the memory system.

Quick tour of memory

There are four main parts here: your program, the standard library, the kernel, and the page tables. You already know your program, so...

Memory allocators like malloc() and calloc() are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc() will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you're asking about is allocating for a large amount of memory at once, malloc() and calloc() will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I've seen 1 MiB used as the threshold.

The kernel is responsible for allocating actual RAM to each process and making sure that processes don't interfere with the memory of other processes. This is called memory protection, it has been dirt common since the 1990s, and it's the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can't just take the memory, but instead it asks for the memory from the kernel using a system call like mmap() or sbrk(). The kernel will give RAM to each process by modifying the page table.

The page table maps memory addresses to actual physical RAM. Your process's addresses, 0x00000000 to 0xFFFFFFFF on a 32-bit system, aren't real memory but instead are addresses in virtual memory. The processor divides these addresses into 4 KiB pages, and each page can be assigned to a different piece of physical RAM by modifying the page table. Only the kernel is permitted to modify the page table.

How it doesn't work

Here's how allocating 256 MiB does not work:

Your process calls calloc() and asks for 256 MiB.
The standard library calls mmap() and asks for 256 MiB.
The kernel finds 256 MiB of unused RAM and gives it to your process by modifying the page table.
The standard library zeroes the RAM with memset() and returns from calloc().
Your process eventually exits, and the kernel reclaims the RAM so it can be used by another process.

How it actually works

The above process would work, but it just doesn't happen this way. There are three major differences.

When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so mmap() guarantees that the new memory it returns is always zeroed.
There are a lot of programs out there that allocate memory but don't use the memory right away. Some times memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn't touch the page table at all and doesn't give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page fault and the kernel steps in assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM.
Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.

The final process looks more like this:

Your process calls calloc() and asks for 256 MiB.
The standard library calls mmap() and asks for 256 MiB.
The kernel finds 256 MiB of unused address space, makes a note about what that address space is now used for, and returns.
The standard library knows that the result of mmap() is always filled with zeroes (or will be once it actually gets some RAM), so it doesn't touch the memory, so there is no page fault, and the RAM is never given to your process.
Your process eventually exits, and the kernel doesn't need to reclaim the RAM because it was never allocated in the first place.

If you use memset() to zero the page, memset() will trigger the page fault, cause the RAM to get allocated, and then zero it even though it is already filled with zeroes. This is an enormous amount of extra work, and explains why calloc() is faster than malloc() and memset(). If end up using the memory anyway, calloc() is still faster than malloc() and memset() but the difference is not quite so ridiculous.

This doesn't always work

Not all systems have paged virtual memory, so not all systems can use these optimizations. This applies to very old processors like the 80286 as well as embedded processors which are just too small for a sophisticated memory management unit.

This also won't always work with smaller allocations. With smaller allocations, calloc() gets memory from a shared pool instead of going directly to the kernel. In general, the shared pool might have junk data stored in it from old memory that was used and freed with free(), so calloc() could take that memory and call memset() to clear it out. Common implementations will track which parts of the shared pool are pristine and still filled with zeroes, but not all implementations do this.

Dispelling some wrong answers

Depending on the operating system, the kernel may or may not zero memory in its free time, in case you need to get some zeroed memory later. Linux does not zero memory ahead of time, and Dragonfly BSD recently also removed this feature from their kernel. Some other kernels do zero memory ahead of time, however. Zeroing pages durign idle isn't enough to explain the large performance differences anyway.

The calloc() function is not using some special memory-aligned version of memset(), and that wouldn't make it much faster anyway. Most memset() implementations for modern processors look kind of like this:

function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it's just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1

So you can see, memset() is very fast and you're not really going to get anything better for large blocks of memory.

The fact that memset() is zeroing memory that is already zeroed does mean that the memory gets zeroed twice, but that only explains a 2x performance difference. The performance difference here is much larger (I measured more than three orders of magnitude on my system between malloc()+memset() and calloc()).

Party trick

Instead of looping 10 times, write a program that allocates memory until malloc() or calloc() returns NULL.

What happens if you add memset()?

@Dietrich: the virtual memory explaination of Dietrich about OS allocating the same zero filled page many times for calloc is easy to check. Just add some loop that write junk data in every allocated memory page (writing one byte every 500 bytes should be enough). The overall result should then become much closer as system would be forced to really allocate differents pages in both cases. — kriss, Apr 22 '10 at 06:43
@kriss: indeed, although one byte every 4096 is sufficient on the vast majority of systems — Dietrich Epp, Apr 22 '10 at 06:46
Actually, `calloc()` is often part of the [`malloc`](https://www.mirbsd.org/man3/malloc) implementation suite, and thus optimised to _not_ call `bzero` when getting memory from `mmap`. — mirabilos, Mar 31 '14 at 20:33
@mirabilos: Actually, implementations tend to be even more sophisticated. Memory allocated by `mmap()` is allocated in large chunks, so the `malloc()` / `calloc()` implementation may keep track of what blocks are still pristine and full of zeroes. So `calloc()` can avoid touching memory even if it doesn't get the memory from `mmap()`, i.e., it was already part of the heap but hasn't been used yet. — Dietrich Epp, Mar 31 '14 at 20:49
@DietrichEpp: yes, some implementations may do that. However, this is error-prone, so some implementations such as omalloc from OpenBSD choose not to. This is a speed optimisation but comes at a heavy maintenance (and code audit) burden. — mirabilos, Mar 31 '14 at 20:57
@mirabilos: I've also seen implementations with a "high water mark", where addresses beyond a certain point are zeroed. I'm not sure what you mean by "error-prone"—if you are worried about applications writing to unallocated memory, then there is very little you can do to prevent insidious errors, short of instrumenting the program with mudflap. — Dietrich Epp, Mar 31 '14 at 21:24
I'm sorry to necro this old answer, but people are linking to it claiming it justifies a blanket statement about calloc being faster than malloc + memset. calloc is *sometimes* going to be faster. Your answer does not explicitly state how allocating memory works (e.g. calloc has a chance to be faster only if a new page is allocated and is not going to be touched, or the buffer to be returned was never used before) and leaves beginners with the wrong impression. It also states that the zero page is used here and the kernel can skip possibly skip mapping a page. continued in the next comment — employee of the month, Jul 26 '16 at 16:40
On mmap the kernel will NOT map anything. Blindly putting in the zero page and having to replace it later with a different page would be a performance hit. When executing the benchmark posted by OP the zero page is not used. As you noted vast majority of pages are simply not accessed here, and this results in very few mappings being populated. Could you please clarify your post, in particular explicitly state when calloc *can* be faster and also state that doing malloc + memset of the whole buffer is wrong, and defaulting to memseting everything is also wrong? thanks — employee of the month, Jul 26 '16 at 16:44
@employeeofthemonth: Originally this was just an answer to explain what the OP observed, not an attempt to explain the differences between `malloc()` and `calloc()` in general. However, I believe I have addressed the main points of your comment already: "When you allocate a large enough region of memory…" is the qualifier for when these tricks that make `calloc()` faster are used, and the parenthetical "Actually, the kernel can…" explains that `mmap()` does not necessarily map the memory, as you say. — Dietrich Epp, Jul 26 '16 at 17:47
@employeeofthemonth: I typically revise answers to expand, clarify, and refine the wording once they reach a reasonable size audience like this one. Since you brought it to my attention, I'll probably do that here in the next fortnight or so, but there is only so much you can do to stop people from misinterpreting things. — Dietrich Epp, Jul 26 '16 at 17:51
Thank you for editing, that's almost what I had in mind. Early you state to always use calloc instead of malloc + memset. Please state to 1. default to malloc 2. if a small part of the buffer needs to be zeroed, memset that part 3. otherwise use calloc. In particular DO NOT malloc + memset the whole size (use calloc for that) and DO NOT default to callocing everything as it hinders things like valgrind and static code analysers (all memory is suddenly initialized). Other than that I think this is fine. — employee of the month, Aug 05 '16 at 21:37
@employeeofthemonth: No. Those suggestions are out of scope. The question is about comparing `malloc()+memset()` to `calloc()`. — Dietrich Epp, Aug 05 '16 at 21:43
This is a minor point of course. I would agree if it was not for people who take 'calloc being faster' as a justification. I find it on topic enough to mention. If you don't want to note it here, can you please edit the section in docs ( http://stackoverflow.com/documentation/c/4726/memory-management/3632/allocating-memory#t=201608052301431671887 ) which makes a blanket statement about calloc being faster and recommends its use? I requested an improvement stating the better usage and the fact that calloc only sometims is faster. It got "fixed" by saying some calloc implementations are faster. — employee of the month, Aug 05 '16 at 23:05
But I'm not going to insist. Thank you once more for addressing major points of my feedback. — employee of the month, Aug 05 '16 at 23:05
Whilst not speed related, `calloc` is also less bug prone. That is, where `large_int * large_int` would result in an overflow, `calloc(large_int, large_int)` returns `NULL`, but `malloc(large_int * large_int)` is undefined behaviour, as you don't know the actual size of the memory block being returned. — Dunes, Mar 23 '18 at 09:41
Will compilers optimize the `malloc()` + `memset()` to one `calloc()`? — Chayim Friedman, Sep 17 '20 at 00:54
@ChayimFriedman: Test it out in Godbolt. https://gcc.godbolt.org/ — Dietrich Epp, Sep 17 '20 at 01:11

score 14 · Answer 2 · answered Apr 22 '10 at 05:48

14

Because on many systems, in spare processing time, the OS goes around setting free memory to zero on its own and marking it safe for calloc(), so when you call calloc(), it may already have free, zeroed memory to give you.

answered Apr 22 '10 at 05:48

Chris Lutz

66,621
15
121
178

2

Are you sure? Which systems do this? I thought that most OSs just shut down the processor when they were idle, and zeroed memory on demand for the processes that allocated as soon as they write to that memory (but not when they allocate it). – Dietrich Epp Apr 22 '10 at 06:00
@Dietrich - Not sure. I heard it once and it seemed like a reasonable (and reasonably simple) way to make `calloc()` more efficient. – Chris Lutz Apr 22 '10 at 06:06
@Pierreten - I can't find any good info on `calloc()`-specific optimizations and I don't feel like interpreting libc source code for the OP. Can you look up anything to show that this optimization doesn't exist / doesn't work? – Chris Lutz Apr 22 '10 at 06:13
@Chris: The code you're looking for is not in the libc code but in the kernel. Libc just calls sbrk or mmmap, and it knows that the kernel only hands out zeroed memory from those two syscalls, so it doesn't zero that memory again. – Dietrich Epp Apr 22 '10 at 06:35
13

@Dietrich: FreeBSD is supposed to zero-fill pages in idle time: See its vm.idlezero_enable setting. – Zan Lynx Mar 07 '11 at 21:47
@ZanLynx `vm.idlezero_enable` is interesting by itself, but it maybe just to prevent information leaks. Is there an API to obtain this zero memory for the *libc* `calloc()`? I guess a static *read-only* zero page is enough; a write fault swaps them with the zero pool. I guess the other part of the pool is for disk buffers? – artless noise Dec 22 '13 at 18:45
1

@DietrichEpp sorry to necro, but for example Windows does this. – Andreas Grapentin Nov 11 '14 at 19:37
calloc() on Windows ~= HeapAlloc(crtheap, HEAP_ZERO_MEMORY, size) – KindDragon Aug 18 '15 at 16:03
@DietrichEpp Mark Russinovich has a amazing talk on how Windows's virtual memory system works, including details about the job that constantly zeros memory pages (may be in Part 2, don't recall): https://www.youtube.com/watch?v=TrFEgHr72Yg – Dan Bechard May 08 '20 at 06:53

score 2 · Answer 3 · answered Apr 22 '10 at 05:51

2

On some platforms in some modes malloc initialises the memory to some typically non-zero value before returning it, so the second version could well initialize the memory twice

answered Apr 22 '10 at 05:51

Stewart

3,810
15
19