69

I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?

edit: The code I am using is more or less this:

string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
    while(txtFile.good())
    {
        m_numLines++;
        getline(txtFile, tmpString);
    }
    txtFile.close();
}

edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.

edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.

edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)

Arne
  • 10,476
  • 3
  • 48
  • 66
  • Using Random Filing or sequential ? Show us your code or what you are reading ? – Shumail Jul 29 '13 at 13:15
  • 2
    Depends a lot on what you're doing with it. – sehe Jul 29 '13 at 13:16
  • You might want to break it in to pieces, since it seems like a memory bottleneck to me 400000 lines * 4000 charters might be 1600000000 characters and probably bytes if one charater is 1 byte on you system – hetepeperfan Jul 29 '13 at 13:20
  • Question, do you use any stringstreams in your actual code? – Neil Kirk Jul 29 '13 at 13:26
  • The are slow. Just checking.. – Neil Kirk Jul 29 '13 at 13:53
  • @Arne can you please tell me how much speedup you got with parallel read? and which technique you used? – Rami Far Nov 06 '18 at 12:07
  • @RamiFar This question is 6 years old, I don't even own that laptop any more. I think it was something similar to sehe's benchmark, along 20% or so. The actual speedup was achieved by writing the program in a way that I could call `wc -l` before running the C++ code. – Arne Nov 06 '18 at 12:23

6 Answers6

79

Updates: Be sure to check the (surprising) updates below the initial answer


Memory mapped files have served me well1:

#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm>  // for std::find
#include <iostream>   // for std::cout
#include <cstring>

int main()
{
    boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    uintmax_t m_numLines = 0;
    while (f && f!=l)
        if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
            m_numLines++, f++;

    std::cout << "m_numLines = " << m_numLines << "\n";
}

This should be rather quick.

Update

In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru

#include <algorithm>
#include <iostream>
#include <cstring>

// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

const char* map_file(const char* fname, size_t& length);

int main()
{
    size_t length;
    auto f = map_file("test.cpp", length);
    auto l = f + length;

    uintmax_t m_numLines = 0;
    while (f && f!=l)
        if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
            m_numLines++, f++;

    std::cout << "m_numLines = " << m_numLines << "\n";
}

void handle_error(const char* msg) {
    perror(msg); 
    exit(255);
}

const char* map_file(const char* fname, size_t& length)
{
    int fd = open(fname, O_RDONLY);
    if (fd == -1)
        handle_error("open");

    // obtain file size
    struct stat sb;
    if (fstat(fd, &sb) == -1)
        handle_error("fstat");

    length = sb.st_size;

    const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
    if (addr == MAP_FAILED)
        handle_error("mmap");

    // TODO close fd at some point in time, call munmap(...)
    return addr;
}

Update

The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:

static uintmax_t wc(char const *fname)
{
    static const auto BUFFER_SIZE = 16*1024;
    int fd = open(fname, O_RDONLY);
    if(fd == -1)
        handle_error("open");

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    uintmax_t lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
        if(bytes_read == (size_t)-1)
            handle_error("read failed");
        if (!bytes_read)
            break;

        for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
            ++lines;
    }

    return lines;
}

1 see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?

Community
  • 1
  • 1
sehe
  • 328,274
  • 43
  • 416
  • 565
  • @ArneRecknagel it uses [Boost Iostreams](http://www.boost.org/doc/libs/1_54_0/libs/iostreams/doc/index.html) for convenience, but you could use [`mmap` (POSIX)](http://linux.die.net/man/2/mmap) or [`MapViewOfFileEx function` (Win32)](http://msdn.microsoft.com/en-us/library/windows/desktop/aa366763(v=vs.85).aspx) if you prefer. – sehe Jul 29 '13 at 13:52
  • @ArneRecknagel I've added a version not using Boost in case it helps. **[See it live on Coliru](http://coliru.stacked-crooked.com/view?id=0a8e265dd1295e3b30607cc83bd7a7d3-e223fd4a885a77b520bbfe69dda8fb91)** (counting the lines in it's own `main.cpp`) – sehe Jul 29 '13 at 14:06
  • @ArneRecknagel I've updated my code after benchmarking with a 8.9GiB file. It turned out that using `memchr` instead of `std::count` made it run in 2.3s instead of 8.4s (**over 3x faster**). Next, using a `read` loop on the `fd` turned out to be marginally faster than using the `mmap`. I show my adapted `wc()` version [here](http://coliru.stacked-crooked.com/view?id=b999f3982d53df0176fd94283924487b-542192d2d8aca3c820c7acc656fa0c68) – sehe Jul 29 '13 at 20:18
  • 1
    Does calling `madvise(addr, 0, MADV_SEQUENTIAL)` after your call to `mmap()` help with performance? That would at least make it more comparable to the `wc()` implementation, which uses `posix_fadvise()`. – Void Jul 29 '13 at 20:31
  • @Void nope, no visible improvements. Thanks for pointing out `madvise` exists too :) – sehe Jul 29 '13 at 21:23
  • 5
    Reading in 16kiB chunks reuses the same 4 pages of address space in your process. You won't have TLB misses, and 16kiB is smaller than L1 cache. The memcpy from page-cache (inside `read(2)`) goes very fast, and the `memchr` only touches memory that's hot in L1. The mmap version has to fault each page, because `mmap` doesn't wire all the pages (unless you use MAP_POPULATE, but that won't work well when file size is a large fraction of RAM size). – Peter Cordes Apr 24 '16 at 08:07
10

4000 * 400,000 = 1.6 GB if you're hard drive isn't an SSD you're likely getting ~100 MB/s sequential read. That's 16 seconds just in I/O.

Since you don't elaborate on the specific code your using or how you need to parse these files (do you need to read it line by line, does the system have a lot of RAM could you read the whole file into a large RAM buffer and then parse it?) There's little you can do to speed up the process.

Memory mapped files won't offer any performance improvement when reading a file sequentially. Perhaps manually parsing large chunks for new lines rather than using "getline" would offer an improvement.

EDIT After doing some learning (thanks @sehe). Here's the memory mapped solution I would likely use.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>

int main() {
    char* fName = "big.txt";
    //
    struct stat sb;
    long cntr = 0;
    int fd, lineLen;
    char *data;
    char *line;
    // map the file
    fd = open(fName, O_RDONLY);
    fstat(fd, &sb);
    //// int pageSize;
    //// pageSize = getpagesize();
    //// data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_PRIVATE, fd, pageSize);
    data = mmap((caddr_t)0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    line = data;
    // get lines
    while(cntr < sb.st_size) {
        lineLen = 0;
        line = data;
        // find the next line
        while(*data != '\n' && cntr < sb.st_size) {
            data++;
            cntr++;
            lineLen++;
        }
        /***** PROCESS LINE *****/
        // ... processLine(line, lineLen);
    }
    return 0;
}
Louis Ricci
  • 19,594
  • 5
  • 43
  • 60
  • 2
    +1 for beer coaster calculations. SSD could reach ~500Gb/s though. Memory mapping could be more efficient depending on the usage scenarios – sehe Jul 29 '13 at 13:23
  • I need to read it line by line, because they don't contain a header which tells me how long they are. I could put them into a RAM buffer because I can discard each one after reading it, but then again, i thought that was what ifstream did. Is there a way to tell a program to just throw the whole thing into RAM? – Arne Jul 29 '13 at 13:32
  • 1
    @sehe - I was always under the impression that memory mapping files was more of a convenience abstraction for overlapping I/O than a performance boost, especially for a sequential read task. My guess is the OP is using "getline" which is reading 1 byte at a time looking for \n and causing a lot of unnecessarily small file reads. Using a larger read buffer in a sequential ifstream would offer the exact same performance a mapped file (but I am very open to be proven wrong). – Louis Ricci Jul 29 '13 at 13:33
  • @ArneRecknagel - if you have enough RAM to handle it you can get the file size and allocate a buffer large enough and do one read operation into the buffer. This will of course have the hefty delay I mentioned, I better way would probably be to allocat a ~16MB sized buffer, read into it, parse the lines you can and move the last (possibly unparsable at this time) line to the beginning of the buffer and continue your read loop into the rest of it. – Louis Ricci Jul 29 '13 at 13:37
  • @ArneRecknagel - the underlying caching and abstraction of a mapped file would make the task I described in my last comment a bit easier, but probably not any faster. – Louis Ricci Jul 29 '13 at 13:38
  • 2
    @LastCoder mmap are a convenience too, but also: prevent paging all the pages you don't access, work in binary mode implicitly, only require _virtual_ address space (as opposed to copying it to a local buffer). Some filesystem drivers may even have zero-copy paths, especially on readonly maps – sehe Jul 29 '13 at 13:42
  • @sehe - Thanks sehe, zero copy gave me something to look into. Seems for sequential read mmap offers an order of magnitude improvement. My previous bias was from work on large file encryption in the past where toggling between an optimal amount of reads and writes was an issue. – Louis Ricci Jul 29 '13 at 13:58
10

Neil Kirk, unfortunately I can not reply to your comment (not enough reputation) but I did a performance test on ifstream an stringstream and the performance, reading a text file line by line, is exactly the same.

std::stringstream stream;
std::string line;
while(std::getline(stream, line)) {
}

This takes 1426ms on a 106MB file.

std::ifstream stream;
std::string line;
while(ifstream.good()) {
    getline(stream, line);
}

This takes 1433ms on the same file.

The following code is faster instead:

const int MAX_LENGTH = 524288;
char* line = new char[MAX_LENGTH];
while (iStream.getline(line, MAX_LENGTH) && strlen(line) > 0) {
}

This takes 884ms on the same file. It is just a little tricky since you have to set the maximum size of your buffer (i.e. maximum length for each line in the input file).

user2434119
  • 139
  • 1
  • 4
3

Do you have to read all files at the same time? (at the start of your application for example)

If you do, consider parallelizing the operation.

Either way, consider using binary streams, or unbffered read for blocks of data.

utnapistim
  • 24,817
  • 3
  • 41
  • 76
  • 4
    parallelizing on a HDD will make things worse, with the impact depending on the distribution of the files on the HDD. On a SSD it might (!) improve things. – ogni42 Jul 29 '13 at 13:49
  • You are probably right (I hadn't considered that single HDD could cause further delays). If op combines unbuffered read (say - moving rdbuf() into separate ostringstream and reading from there) it may still be faster). I guess once op decides on implementation, he(she?) will have to measure and find out. – utnapistim Jul 29 '13 at 13:57
3

As someone with a little background in competitive programming, I can tell you: At least for simple things like integer parsing the main cost in C is locking the file streams (which is by default done for multi-threading). Use the unlocked_stdio versions instead (fgetc_unlocked(), fread_unlocked()). For C++, the common lore is to use std::ios::sync_with_stdio(false) but I don't know if it's as fast as unlocked_stdio.

For reference here is my standard integer parsing code. It's a lot faster than scanf, as I said mainly due to not locking the stream. For me it was as fast as the best hand-coded mmap or custom buffered versions I'd used previously, without the insane maintenance debt.

int readint(void)
{
        int n, c;
        n = getchar_unlocked() - '0';
        while ((c = getchar_unlocked()) > ' ')
                n = 10*n + c-'0';
        return n;
}

(Note: This one only works if there is precisely one non-digit character between any two integers).

And of course avoid memory allocation if possible...

Jo So
  • 20,794
  • 6
  • 35
  • 57
1

Use Random file access or use binary mode. for sequential, this is big but still it depends on what you are reading.

Shumail
  • 2,950
  • 3
  • 25
  • 33