4

In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck.

In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like.

std::ifstream file(filename_xml.c_str());
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
    m_numLines++;
}

And ok, that's working but to slowly here is a time for my 3.6 GB of data:

real    1m4.155s
user    0m0.000s
sys     0m0.030s

I'm looking for a method that will be much faster than that for example I found that How to parse space-separated floats in C++ quickly? and I loved presented solution with boost::mapped_file but I faced to another problem what if my file is to big and in my case file 1GB large was enough to drop entire process. I have to care about current data in memory probably people who will be using that tool doesn't have more than 4 GB installed RAM.

So I found that mapped_file from boost but how to use it in my case? Is it possible to read partially that file and receive these lines?

Maybe you have another much better solution. I have to just process each line.

Thanks,
Bart

Community
  • 1
  • 1
bioky
  • 51
  • 7
  • 1
    You can map only part of the file. – clcto Aug 05 '14 at 22:02
  • 3
    memmapping will to map the ENTIRE file into the memory space. that's impossible, since your file would suck up the entire addressable space of the process. you'd need to "window" the file, so only smaller parts of it are visible at any given time via your memmap area. – Marc B Aug 05 '14 at 22:04
  • 1
    What are the stats if you run it twice in a row? The fact that there's almost zero user or sys time implies that most of the time is spent doing I/O. Going to memory mapped files won't improve the speed (since that data needs to get paged in) unless you have enough memory to cache the whole file. – MSN Aug 05 '14 at 22:06
  • @MarcB Yeah I was thinking about that but how to map smaller chunks of file – bioky Aug 05 '14 at 22:15
  • @MSN i got following results: 'real 0m51.170s user 0m0.000s sys 0m0.000s' 'real 0m50.100s user 0m0.000s sys 0m0.000s' – bioky Aug 05 '14 at 22:17
  • How long are the longest lines? – Fors Aug 05 '14 at 22:19
  • 4
    A typical 7200 rpm spindle disk drive can read at 60 MB/sec at best. 3.6 GB takes 1 minute to read, no matter what kind of code you write. You'll need a faster disk or stop waiting for the program to finish. – Hans Passant Aug 05 '14 at 22:30
  • @Fors Longest line have 70 bytes but lines have different sizes – bioky Aug 05 '14 at 23:08
  • @HansPassant I agree but here I'm little confused example source that counts new lines but based on mapping_file and run on smaller file took around 5s to give a valid response so here is akward moment how is it possible? – bioky Aug 05 '14 at 23:11
  • @bioky: You say you agree, but then go ahead and ask for something that he just said is physically impossible. The example source ran faster because it ran _on a smaller file_. – Mooing Duck Aug 06 '14 at 00:41
  • 2
    @bioky - it will be a lot faster when the disk is not used but the data comes from the file system cache and the machine has sufficient RAM. A typical benchmark hazard when you run a program repeatedly. – Hans Passant Aug 06 '14 at 07:20
  • @MooingDuck Yeah you are right but it was correlated with that the same file ran on 'ifstream' took around 10s, but that is probably correlated with copying content to memory etc. – bioky Aug 06 '14 at 09:36
  • 1
    @bioky: For small files, it also depends on if the file has been used "recently". Do each test multiple times and it should be about the same. – Mooing Duck Aug 06 '14 at 16:22

4 Answers4

6

Nice to see you found my benchmark at How to parse space-separated floats in C++ quickly?

It seems you're really looking for the fastest way to count lines (or any linear single pass analysis), I've done a similar analysis and benchmark of exactly that here

Interestingly, you'll see that the most performant code does not need to rely on memory mapping at all there.

static uintmax_t wc(char const *fname)
{
    static const auto BUFFER_SIZE = 16*1024;
    int fd = open(fname, O_RDONLY);
    if(fd == -1)
        handle_error("open");

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    uintmax_t lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
        if(bytes_read == (size_t)-1)
            handle_error("read failed");
        if (!bytes_read)
            break;

        for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
            ++lines;
    }

    return lines;
}
Community
  • 1
  • 1
sehe
  • 328,274
  • 43
  • 416
  • 565
3

The case of a 64-bit system with small memory should be fine to load a large file into - it's all about address space - although it may well be slower than the "fastest" option in that case, it really depends on what else is in memory and how much of the memory is available for mapping the file into. In a 32-bit system, it won't work, since the pointers into the filemapping won't go beyond about 3.5GB at the very most - and typically around 2GB is the maximum - again, depending on what memory addresses are available to the OS to map the file into.

However, the benefit of memory mapping a file is pretty small - the huge majority of the time spent is from actually reading the data. The saving from using memory mapping comes from not having to copy the data once it's loaded into RAM. (When using other file-reading mechanisms, the read function will copy the data into the buffer supplied, where memory mapping a file will stuff it straight into the correct location directly).

Mats Petersson
  • 119,687
  • 13
  • 121
  • 204
  • I will try mapping with compilation on x64 maybe system do the rest of job via paging. But from the other hand I think the way with the window on file directly is still good. In that idea I could change some processing stuff to omit a huge pack of bytes in that file which could be a little faster. What do you think? – bioky Aug 05 '14 at 23:18
0

You might want to look at increasing the buffer for the ifstream - the default buffer is often rather small, this leads to lots of expensive reads.

You should be able to do this using something like:

std::ifstream file(filename_xml.c_str());
char buffer[1024*1024];
file.rdbuf()->pubsetbuf(buffer, 1024*1024);

uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
    m_numLines++;
}

See this question for more info:

How to get IOStream to perform better?

Community
  • 1
  • 1
Bids
  • 2,362
  • 17
  • 26
0

Since this is windows, you can use the native windows file functions with the "ex" suffix:

windows file management functions

specifically the functions like GetFileSizeEx(), SetFilePointerEx(), ... . Read and write functions are limited to 32 bit byte counts, and the read and write "ex" functions are for asynchronous I/O as opposed to handling large files.

rcgldr
  • 23,179
  • 3
  • 24
  • 50