198

I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.

Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.

The mmap() code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially like across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.

How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?

isomorphismes
  • 7,610
  • 9
  • 53
  • 69
jbl
  • 2,300
  • 3
  • 16
  • 13
  • 1
    This is an interesting read: https://medium.com/@sasha_f/why-mmap-is-faster-than-system-calls-24718e75ab37 In the experiments `mmap()` is 2-6 times faster than using syscalls, e.g. `read()`. – mplattner May 02 '20 at 23:42

12 Answers12

230

I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.

  • A call to mmap has more overhead than read (just like epoll has more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
  • The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.

However,

  • Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
  • Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
  • Reading a file directly is very simple and fast.

The discussion of mmap/read reminds me of two other performance discussions:

  • Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.

  • Some other network programmers were shocked to learn that epoll is often slower than poll, which makes perfect sense if you know that managing epoll requires making more syscalls.

Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.

(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)

MvG
  • 51,562
  • 13
  • 126
  • 251
Dietrich Epp
  • 182,361
  • 34
  • 307
  • 387
  • 1
    Keep in mind that using any advice based on hardware and software from the 2000s, without testing it today would be a very suspect approach. Also, while many of the facts about `mmap` vs `read()` in that thread are still true as they were in the past, the overall performance can't really be determined by adding up the pros and cons, but only by testing on a particular hardware configuration. For example, it is debatable that "A call to mmap has more overhead than read" - yes `mmap` has to add mappings to the process page table, but `read` has to copy all read bytes from kernel to user space. – BeeOnRope May 26 '18 at 02:59
  • The upshot is that, on my (modern Intel, circa 2018) hardware, `mmap` has lower overhead than `read` for larger-than-page-sized (4 KiB) reads. Now it's very true that if you want to access data sparsely and randomly, `mmap` is really, really good - but the converse isn't necessary true: `mmap` may still be the best for sequential access as well. – BeeOnRope May 26 '18 at 03:00
  • 2
    @BeeOnRope: You may be skeptical of advice based on hardware and software from the 2000s, but I am even more skeptical of benchmarks that don't provide a methodology and data. If you would like to make a case that `mmap` is faster, I would expect to see at a bare minimum the entire testing apparatus (source code) with the tabulated results, and the processor model number. – Dietrich Epp May 26 '18 at 03:08
  • @BeeOnRope: Also keep in mind that when you are testing bits of the memory system like this, microbenchmarks can be extremely deceptive because a TLB flush can negatively impact the performance of the rest of your program, and this impact won't show up if you only measure the mmap itself. – Dietrich Epp May 26 '18 at 03:12
  • I'm not trying to make a claim that people should accept my results at face value over those presented in the linked thread. I'm saying _neither_ is sufficient: people should test it on their own system rather than simply accepting the results from that thread (which also didn't provide a detailed methodology or data). I provided my results mostly to point out that my results today are the opposite of Paul's back then, as part of a plea for people to test it locally. Non-quantitative arguments are really only convincing when on solution dominates the other, and this is not the case here. – BeeOnRope May 26 '18 at 03:16
  • 2
    @DietrichEpp - yes, I'll well versed in TLB effects. Note that `mmap` does not flush the TLB except in unusual circumstances (but `munmap` might). My tests included both microbenchmarks (including `munmap`) _and_ also "in application" running in a real-world use case. Of course my application is not the same as your application, so people should test locally. It isn't even clear that `mmap` is favored by a micro-benchmark: `read()` also gets a big boost since the user-side destination buffer generally stays in L1, which may not happen in a larger application. So yeah, "it's complicated". – BeeOnRope May 26 '18 at 03:20
  • I really wonder how that align with new byte adressable persistent memory (optane DCPMM for instance). – claf Oct 08 '19 at 14:50
48

There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.

mmap seems like magic

Taking the case where the file is already fully cached1 as the baseline2, mmap might seem pretty much like magic:

  1. mmap only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.
  2. mmap doesn't require a copy of the file data from kernel to user-space.
  3. mmap allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.

In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.

Well, it can.

mmap is not actually magic because...

mmap still does per-page work

A primary hidden cost of mmap vs read(2) (which is really the comparable OS-level syscall for reading blocks) is that with mmap you'll need to do "some work" for every 4K page in user-space, even though it might be hidden by the page-fault mechanism.

For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 billion page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.

mmap relies heavily on TLB performance

Now, you can pass MAP_POPULATE to mmap to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.

Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.

Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!

read() avoids these pitfalls

The read() syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:

  • Every read() call of N bytes must copy N bytes from kernel to user space.

On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc a single buffer small buffer in user space, and re-use that repeatedly for all your read calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.

So basically you have the following comparison to determine which is faster for a single read of a large file:

Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?

On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.

In particular, the mmap approach becomes relatively faster when:

  • The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
  • The OS has a good MAP_POPULATE implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
  • The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.

... while the read() approach becomes relatively faster when:

  • The read() syscall has good copy performance. E.g., good copy_to_user performance on the kernel side.
  • The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
  • The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.

The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).

The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:

  • Addition of fault-around, described above, which really helps the mmap case without MAP_POPULATE.
  • Addition of fast-path copy_to_user methods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQ when it is fast, which really help the read() case.

Update after Spectre and Meltdown

The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.

All of this is a relative disadvantage for read() based methods as compared to mmap based methods, since read() methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.

On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATE and the access it efficiently, at the cost of only a single system call.


1 This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap and read calls, and can be further adjusted by "advise" calls as described in 2.

2 ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise or fadvise calls (and whatever application level changes you can make to improve access patterns).

3 You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.

4 In fact, it turns out the MAP_POPULATE approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.

Ton van den Heuvel
  • 8,849
  • 5
  • 36
  • 76
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • 4
    Thank you for providing a more nuanced answer to this complex issue. It seems obvious to most people that mmap is faster, when in reality it is often not the case. In my experiments, randomly accessing a large 100GB database with an in-memory index turned out to be faster with pread(), even though I was malloc'ing a buffer for each of the millions of accesses. And it seems like a lof of people in the industry [have observed the same](https://github.com/facebook/rocksdb/issues/507). – Caetano Sauer May 03 '17 at 09:03
  • 5
    Yeah, it depends a lot on the scenario. If you reads are _small enough_ and over time you tend to repeatedly read the same bytes, `mmap` will have an insurmountable advantage since it avoids the fixed kernel call overhead. On the other hand, `mmap` also increases TLB pressure, and actually make be slower for the "warm up" phase where bytes are being read for the first time in the current process (although they are still in the page page), since it may do more work than `read`, for example to "fault-around" adjacent pages... and for same applications "warm up" is all that matters! @CaetanoSauer – BeeOnRope May 03 '17 at 19:12
  • 1
    I think where you say "...but 25 billion page faults is still not going to be super fast..." it should read "...but 25 *million* page faults is still not going to be super fast...". I am not 100% positive, so that is why I am not editing directly. – Ton van den Heuvel Mar 27 '19 at 06:57
47

The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.

I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is way faster" and found no measurable difference. See my comments on his answer.

I would certainly not recommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....

In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:

  1. There is random access (not sequential) within the file, AND
  2. the whole thing fits comfortably in memory OR there is locality-of-reference within the file so that certain pages can be mapped in and other pages mapped out. That way the operating system uses the available RAM to maximum benefit.
  3. OR if multiple processes are reading/working on the same file, then mmap() is fantastic because the processes all share the same physical pages.

(btw - I love mmap()/MapViewOfFile()).

Tim Cooper
  • 8,926
  • 4
  • 57
  • 68
  • Good point about random access: this might be one of the things driving my perception. – Ben Collins May 19 '11 at 15:40
  • 2
    I wouldn't say the file has to comfortably fit into memory, only into address space. So on 64bit systems, there should be no reason not to map huge files. The OS knows how to handle that; it's the same logic used for swapping but in this case doesn't require additional swap space on disk. – MvG May 02 '14 at 21:23
  • @MvG : Do you understand the point about disk i/o? If the file fits into address space but not memory and you have random access then you could have every record access requiring a disk head move and seek, or an SSD page operation, which would be a disaster for performance. – Tim Cooper May 03 '14 at 21:45
  • 3
    The disk i/o aspect should be independent from the access method. If you have truly random access to larger-than-RAM files, both mmap and seek+read are severely disk-bound. Otherwise both will benefit from the caches. I don't see file size compared to memory size as a strong argument in either direction. File size vs. address space, on the other hand, is a very strong argument, particularly for truly random access. – MvG May 03 '14 at 22:07
  • My original answer had and has this point: "the whole thing fits comfortably in memory OR there is locality-of-reference within the file ". So the 2nd point addresses what you're saying. – Tim Cooper May 05 '14 at 00:52
43

mmap is way faster. You might write a simple benchmark to prove it to yourself:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
  in.read(data, 0x1000);
  // do something with data
}

versus:

const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;

int fd = open("filename.bin", O_RDONLY);

while (off < file_size)
{
  data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
  // do stuff with data
  munmap(data, page_size);
  off += page_size;
}

Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size, for instance), but it really shouldn't be much more complicated than this.

If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).

A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(

Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.

Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap, viewed as a black box, will always always always be substantially faster than read. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read may be suitable without sacrificing measurably any performance.

Edit to clean up answer list: @jbl:

the sliding window mmap sounds interesting. Can you say a little more about it?

Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream).

Boost::Iostreams already has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambuf and std::istream. You could also try a similar approach by just inheriting std::filebuf into a mapped_filebuf and similarly, inheriting std::fstream into a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreams has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.

Ben Collins
  • 19,889
  • 16
  • 119
  • 182
  • 3
    RE: mmaped file cache on Windows. Exactly: when file buffering is enabled, the kernel memory maps the file you're reading internally, reads into that buffer and copies it back into your process. It's as if you memory mapped it yourself except with an extra copy step. – Chris Smith Oct 26 '08 at 21:53
  • 6
    I'm loath to disagree with an accepted answer, but I believe this answer is wrong. I followed your suggestion and tried your code, on a 64bit Linux machine, and mmap() was no faster than the STL implementation. Also, theoretically I would not expect 'mmap()' to be any faster (or slower). – Tim Cooper Dec 08 '08 at 04:07
  • 3
    @Tim Cooper: you may find this thread (http://markmail.org/message/zbxnldcz6dlgo5of#query:mmap%20vs%20blocks+page:1+mid:zbxnldcz6dlgo5of+state:results) of interest. Note the two things: mmap isn't properly optimized in Linux, and one need also use madvise in their test to get best results. – Ben Collins Dec 08 '08 at 11:40
  • FWIW, I also played around reading chunks off disk (e.g., 4096 bytes at a time) and didn't see any improvement from, essentially, reading a record at a time. I saw another relevant post somewhere on mmap() vs. read() in mmap()'s favor. I'll try and dig it up. – jbl Dec 08 '08 at 15:40
  • 9
    Dear Ben: I've read that link. If 'mmap()' is not faster on Linux, and MapViewOfFile() is not faster on Windows, then can you make the claim that "mmap is way faster"? Also, for theoretical reasons I believe mmap() is no faster for sequential reads - do you have any explanation to the contrary? – Tim Cooper Dec 08 '08 at 23:11
  • Here's a link to someone who's done the copy file test with a variety of methods. http://lkml.org/lkml/2008/1/14/491. I think I probably need to write the mmap version of my code to fully resolve this question. I was hoping not to have to! – jbl Dec 09 '08 at 18:30
  • There seems to be evidence going both ways, so I'm going to unaccept this until there's some more definitive evidence from me or others. – jbl Jan 13 '09 at 04:04
  • 1
    More details on my tests to disprove Ben's assertion: I ran 3 programs, Ben's 2 programs plus a version using C-style open/read. I got these results: open/read: 7m42s, ifstream: 8m27s, mmap: 8m08s on a 2-year old Linux machine with 2.5Gb RAM, 4.9Gb datafile, calculating a simple xor checksum. – Tim Cooper Jan 13 '09 at 06:11
  • 2
    @Tim Cooper: My tests are probably not good benchmarks, because mmapping a single page and reading one page's worth of data at a time using istream.read() likely amount to the same operations under the hood. The reason for my assertion is because of experience in trying to do it both ways. For example, I've written code in which I tried to open very large files for reading, and found that mmap was /way/ faster just as I claim. A better benchmark would test different page sizes, different file sizes, different loop constructs, etc. In the end, I think most would find mmap to be faster. – Ben Collins Oct 26 '10 at 22:31
  • 11
    Ben, why bother to `mmap()` the file a page at a time? If a `size_t` is capacious enough to hold the size of the file (very likely on 64-bit systems), then just `mmap()` the entire file in one call. – Steve Emmerson Mar 25 '11 at 18:06
  • 1
    Came to this post having faced a similar problem myself, trying to use mmaped file access for sequentially processing a large file. Even using madvise with POSIX_MADV_SEQUENTIAL there is no improvement whatsoever – juhanic May 19 '11 at 08:40
  • 3
    @juhanic: Are you mmapping the entire file? The performance benefit in part comes from not making repeated system calls. – Joseph Garvin Aug 10 '11 at 01:28
  • @BenCollins: `"Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true"` - and also completely beside the point. The Q. is titled `mmap() vs. reading blocks`, so no you can't argue that _mmap is (often?) way faster_ – sehe Sep 27 '11 at 12:28
  • 1
    Are you by chance forgetting to call `ios_base::sync_with_stdio(false)`? – Alexei Averchenko Jun 29 '15 at 05:15
7

I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.

Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.

The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.

If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.

This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).

mlbrock
  • 91
  • 1
  • I just googled and found this little snippet about dwAllocationGranularity -- I was using dwPageSize and everything was breaking. Thanks! – wickedchicken Nov 10 '10 at 06:58
5

mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.

Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.

Leon Timmermans
  • 29,284
  • 2
  • 59
  • 110
3

I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?

Ben Collins wrote:

char data[0x1000];
std::ifstream in("file.bin");

while (in)
{
    in.read(data, 0x1000);
    // do something with data 
}

I would suggest also trying:

char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream  in( ifile.rdbuf() );

while( in )
{
    in.read( data, 0x1000);
    // do something with data
}

And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.

paxos1977
  • 135,245
  • 26
  • 85
  • 125
3

Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).

Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?

Douglas Leeder
  • 49,001
  • 8
  • 86
  • 133
2

I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers. So in fact I was comparing a single call to mmap (or its counterpart on Windows) against many (MANY) calls to operator new and constructor calls. For such kind of task, mmap is unbeatable compared to de-serialization. Of course one should look into boosts relocatable pointer for this.

  • That sounds more like a recipe for disaster. What do you do if the object layout changes? If you have virtual functions, all the vftbl pointers will probably be wrong. How do you control where the file is mapped to? You can give it an address, but it's only a hint and the kernel may choose another base address. – Jens Apr 24 '16 at 08:42
  • 1
    This works perfectly when you have a stable and clearly defined tree layout. Then you can cast everything to your relevant structs and follow the internal file pointers by adding an offset of "mmap start address" each time. This is very similar to file systems using inodes and directory trees – Mike76 Jan 19 '17 at 23:44
2

To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probably done a better job implementing caching than I can...

mike
  • 393
  • 2
  • 6
  • 1
    Most likely you can do a better of job of caching your application-specific data than the kernel can, which operates on page sized chunks in a very blind way (e.g., it only uses a simple pseudo-LRU scheme to decide which pages to evict) - while you may know a lot about the right caching granularity and also have a good idea of future access patterns. The real benefit of `mmap` for caching is that you simply _re-use_ the existing page cache which is already going to be there, so you get that memory for free, and it can be shared across processes too. – BeeOnRope Jan 02 '17 at 16:44
1

I think the greatest thing about mmap is potential for asynchronous reading with:

    addr1 = NULL;
    while( size_left > 0 ) {
        r = min(MMAP_SIZE, size_left);
        addr2 = mmap(NULL, r,
            PROT_READ, MAP_FLAGS,
            0, pos);
        if (addr1 != NULL)
        {
            /* process mmap from prev cycle */
            feed_data(ctx, addr1, MMAP_SIZE);
            munmap(addr1, MMAP_SIZE);
        }
        addr1 = addr2;
        size_left -= r;
        pos += r;
    }
    feed_data(ctx, addr1, r);
    munmap(addr1, r);

Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap. I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.

ony
  • 9,517
  • 1
  • 30
  • 40
  • 2
    You want [`posix_madvise` with the `WILLNEED`](http://linux.die.net/man/3/posix_madvise) flag for lazy hints to prepopulate. – ShadowRanger Nov 05 '16 at 03:41
  • @ShadowRanger, sounds reasonable. Though I'd update man page to clearly state that `posix_madvise` is async call. Also would be nice to reference `mlock` for those who wants to wait untill whole memory region become available without page faults. – ony Nov 07 '16 at 13:32
1

This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.

Pat Notz
  • 186,044
  • 29
  • 86
  • 92
  • Yep. I have been thinking about that and will probably try it out in a later release. The only reservation I have is that the processing is far shorter than the I/O latency, so there may not be much benefit. – jbl Sep 06 '08 at 19:15