Why is allocation using np.empty not O(1)

Question

It is said on the official numpy docs

Return a new array of given shape and type, without initializing entries.

for np.empty, which would imply that the time taken to create (allocate) this array would be O(1), but some simple testing in timeit shows this is not the case:

>>> timeit.timeit(lambda: np.empty(100000000 ), number=10000)
0.2733485999999914
>>> timeit.timeit(lambda: np.empty(1000000000), number=10000)
0.8293009999999867

As a side question, what are the values present in an untouched np.empty array? They're all really small values, but I would expect them to just be whatever values are present in memory at that address. (Example array: np.empty(2) = array([-6.42940774e-036, 2.07409447e-117]). These don't seem anything like what would be stored in memory)

With arrays this large you can run into complexities of memory management. — hpaulj, Apr 21 '21 at 07:08
Why do you think, the numbers you observe do not represent “whatever values are present in memory at that address”? What makes those small values less likely than the random numbers you expected? — Holger, Apr 21 '21 at 07:49
@Holger I guess because they are consistently small numbers, I would think they should be random integer numbers of 4 bytes long or however, like I'm sure it's *some* representation of the memory, it just seems pretty removed from binary data — Recessive, Apr 21 '21 at 09:58
Why should interpreting random data as float values result in integer numbers? — Holger, Apr 21 '21 at 10:46
Reinterpreting the first 64-bits value to 64-floats using `np.arange(10, dtype=np.uint64).view(' — Jérôme Richard, Apr 21 '21 at 12:56

score 6 · Accepted Answer · answered Apr 21 '21 at 11:00

First of all, I tried to reproduce this behaviour on my machine with various sizes. Here are the raw results:

np.empty(10**1)   # 421 ns ± 23.7 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**2)   # 406 ns ± 1.44 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**3)   # 471 ns ± 5.8 ns per loop     (on 7 runs, 1000000 loops each)
np.empty(10**4)   # 616 ns ± 1.56 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**5)   # 620 ns ± 2.83 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**6)   # 9.61 µs ± 34.2 ns per loop   (on 7 runs, 100000 loops each)
np.empty(10**7)   # 11.1 µs ± 17.6 ns per loop   (on 7 runs, 100000 loops each)
np.empty(10**8)   # 22.1 µs ± 173 ns per loop    (on 7 runs, 10000 loops each)
np.empty(10**9)   # 62.8 µs ± 220 ns per loop    (on 7 runs, 10000 loops each)
np.empty(10**10)  # => Memory Error

Thus, you are right: this is not done is O(1) (at least on my Windows machine and your system too). Note that the values cannot be (eagerly) initialized during such a small time because it would imply a RAM throughput of more than 127 TB/s which I clearly do not have on my machine.

for np.empty, which would imply that the time taken to create (allocate) this array would be O(1)

The hypothesis that allocations are done in O(1) is not totally true. To check that, I built a simple C program doing a simple malloc+free loop and measured the timings. Here are the raw results:

./malloc.exe 10           # Average time:  41.815 ns (on 1 run, 1000000 loops each)
./malloc.exe 100          # Average time:  45.295 ns (on 1 run, 1000000 loops each)
./malloc.exe 1000         # Average time:  47.400 ns (on 1 run, 1000000 loops each)
./malloc.exe 10000        # Average time: 122.457 ns (on 1 run, 1000000 loops each)
./malloc.exe 100000       # Average time: 123.032 ns (on 1 run, 1000000 loops each)
./malloc.exe 1000000      # Average time:   8.351 us (on 1 run, 1000000 loops each)
./malloc.exe 10000000     # Average time:   9.342 us (on 1 run, 100000 loops each)
./malloc.exe 100000000    # Average time:  18.972 us (on 1 run, 10000 loops each)
./malloc.exe 1000000000   # Average time:  64.527 us (on 1 run, 10000 loops each)
./malloc.exe 10000000000  # => Memory error

As you can see, the results are matching with the ones of Numpy (except for the small ones which is due to the overhead of calling a Python function in CPython). Thus, the issue does not comes from Numpy but the allocation algorithm in the standard libc or the OS itself.

As a side question, what are the values present in an untouched np.empty array?

It is uninitialized data. In practice, it is often zero-initialized (but not always) because mainstream platforms sanitize allocated memory for security reasons (so that critical data like passwords do not leak when they are previously stored in the memory of another process). You should not rely on this.

Deeper explanation of the `malloc` timings:

As you can see, there is a gap between the allocation of 100K items and 1M items. This can be explained by the use of a fast user-space allocator (called sbrk on Unix and Linux systems): when data are small, the libc of most mainstream platforms does not directly request memory to the operating system. It rather use a fast pre-allocated local memory-pool. Actually, on most mainstream platforms, multiple pool of different sizes are pre-allocated and the libc choose the "right one" depending on the allocated size, hence the timing variation for small data size. Note that this process is done to improve the allocation speed while taking into account memory fragmentation. This strategy is much faster as kernel calls (like mmap) are very expensive (it takes at least several micro-seconds on my machine).

Moreover, most operating systems (OS) have what looks like multiple memory pools. Linux, MacOS and Windows split the virtual memory into small pages (of typically 4KB). Since working on too small pages introduces a significant overhead when dealing with GB/TB of allocated data, these OS provide also big pages called super-pages or huge-pages (of typically 2MB to few GBs). The path taken in the OS can change regarding the amount of allocated memory and most OS are optimized for allocating small chunks of virtual memory and not big ones.

Note that the size of the data structures used to manage the system memory is often bounded by the size of the RAM which is generally constant at runtime. Moreover, the complexity of the algorithm used in a given OS to manage the memory fragmentation may be theoretically O(1) (or close to that). As a result, some people argue that allocating/freeing data is done in constant time. But this controversial because one should take into account practical results and not just theoretical asymptotic bounds.

For more information you can look to the following posts:

Why is allocation using np.empty not O(1)

1 Answers1

Deeper explanation of the malloc timings:

Deeper explanation of the `malloc` timings: