3

I am trying to read 200,000 records from a file and then use tokenizer to parse the string and remove the quotes which are around each part. But the running time is very high compared to normally reading a string. It took 25 seconds just to read these records (0.0001 second per record????). Is there any problem with my programming or if not is there a faster way to do this?

int main()
{
    int counter = 0;
    std::string getcontent;
    std::vector<std::string> line;
    std::vector< std::vector<std::string> > lines;

    boost::escaped_list_separator<char> sep( '\\', '*', '"' ) ;
    boost::tokenizer<> tok(getcontent);

    std::ifstream openfile ("test.txt");

    if(openfile.is_open())
    {
        while(!openfile.eof())
        {
            getline(openfile,getcontent);

            // THIS LINE TAKES A LOT OF TIME
            boost::tokenizer<> tok(getcontent); 

            for (boost::tokenizer<>::iterator beg=tok.begin(); beg!=tok.end(); ++beg){
                line.push_back(*beg);
            }

            lines.push_back(line);
            line.clear();
            counter++;
        }
        openfile.close();
    }
    else std::cout << "No such file" << std::endl;

    return 0;
}
Rody Oldenhuis
  • 36,880
  • 7
  • 47
  • 94
POD
  • 469
  • 7
  • 20
  • So you're just trying to read a file without the quote marks? – Jerry Coffin Sep 10 '12 at 02:30
  • 3
    Though irrelevant, `while (!ioenfile.eof())` is the wrong way to say it. Use `while (getline(openfile, getcontent))`. – chris Sep 10 '12 at 02:30
  • 4
    Please don't loop while not eof. That won't have the desired behaviour. If `!openfile.eof()` is true, that's not a guarantee that a read will succeed. See http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong – R. Martinho Fernandes Sep 10 '12 at 02:31
  • All I want to do to read lots of record and extract their parts. Some of them have quotes around them, but there aren't any special character inside the quotes. – POD Sep 10 '12 at 02:33
  • @POD: If you just want to get rid of quotes, you can do that pretty easily without using `boost::tokenizer` at all. You still haven't made it clear whether that's sufficient to your needs or not though. – Jerry Coffin Sep 10 '12 at 02:35
  • Yeah, that is all I need. My records are like: 837478738*"EP"*10*"3FB2B464BD5003B55CA6065E8E040A2A"*"F"*21*15*"NH"*"N"*0**-1*"-1"*0*0**-1*196482*-1*"23"*1*"-1"*"-1"*"78903"*"V1301"*"-1"*"-1"*"-1"*"-1"*"-1"*"-1"*"-1"*"-1"*"-1"*"-1"*"-1"*"0940"*"-1"*"-1"*"-1"*"-1"*1*1*35.46*31.2*0*0*0*0*"MC" I have to use the * as a delimiter character and " as a quote character. There are no special character inside those quotes, but there could be empty parts that only have ** with nothing inside them – POD Sep 10 '12 at 02:37
  • At what part in your code are you passing the separators into the tokenizer? Why do you have an unused tokenizer declared outside the loop? Aren't you supposed to do it [like this](http://stackoverflow.com/a/55680/1553090)? – paddy Sep 10 '12 at 02:44
  • @JerryCoffin Do you have a faster suggestion? – POD Sep 10 '12 at 02:53
  • @POD: Working on an answer right now. I'm not sure it'll be faster, but at least worth testing, I think. – Jerry Coffin Sep 10 '12 at 02:54
  • @R.MartinhoFernandes Would that change the time to read the records? – POD Sep 10 '12 at 03:00
  • The fastest solution will be to slurp the entire file into memory, tokenise it yourself and index the lines/tokens. – paddy Sep 10 '12 at 03:05
  • @paddy how can I read the entire file into the memory before tokenizing it? – POD Sep 10 '12 at 03:16
  • @POD: http://stackoverflow.com/a/3304059/179910 – Jerry Coffin Sep 10 '12 at 03:28

3 Answers3

4

At least if I'm reading this correctly, I think I'd take a rather more C-like approach. Instead of reading a line, then breaking it up into tokens, and stripping out the characters you don't want, I'd read a character at a time, and based on the character I read, decide whether to add it to the current token, end the token and add it to the current line, or end the line and add it to the vector of lines:

#include <vector>
#include <string>
#include <stdio.h>
#include <time.h>

std::vector<std::vector<std::string> > read_tokens(char const *filename) {
    std::vector<std::vector<std::string> > lines;
    FILE *infile= fopen(filename, "r");

    int ch;

    std::vector<std::string> line;
    std::string token;

    while (EOF != (ch = getc(infile))) {
        switch(ch) {
            case '\n':
                lines.push_back(line);
                line.clear();
                token.clear();
                break;
            case '"':
                break;
            case '*':
                line.push_back(token);
                token.clear();
                break;
            default:
                token.push_back(ch);
        }
    }
    return lines;
}

int main() {
    clock_t start = clock();
    std::vector<std::vector<std::string> > lines = read_tokens("sample_tokens.txt");
    clock_t finish = clock();
    printf("%f seconds\n", double(finish-start)/CLOCKS_PER_SEC);
    return 0;
}

Doing a quick test with this on a file with a little over 200K copies of the sample you gave in the comment, it's reading and (apparently) tokenizing the data in ~3.5 second with gcc or ~4.5 seconds with VC++. I'd be a little surprised to see anything get a whole lot faster (at least without faster hardware).

As an aside, this is handling memory about as you originally did, which (at least in my opinion) is pretty strong evidence that managing memory in the vector probably isn't a major bottleneck.

Jerry Coffin
  • 437,173
  • 71
  • 570
  • 1,035
  • @JerryCoffin Paddy also mentioned reading the entire file into the memory first. How would that effect the speed? Does that have a huge difference? – POD Sep 10 '12 at 03:30
  • @pod That wouldn't be much use if you're turning it into strings anyway. I was talking about reading the whole lot, tokenizing it in-place, and indexing into it (so your strings would be char*). I'd only do that if I was really concerned about speed. – paddy Sep 10 '12 at 03:33
  • @pod And will that data file fit in RAM? How big is it? Do you need to store the whole thing in memory at once? – paddy Sep 10 '12 at 03:42
  • Reading it all into memory, and tokenizing from there, would probably be worthwhile for 100 million records. If you're on Linux, it's probably still better to use `mmap` instead of reading it directly. To give an idea of what you could hope for, consider that the other code I linked to for reading the file into a string reads the current test file in 0.3 seconds. I would *not*, however, use raw `char *` for the tokens though -- I'd encapsulate it in a string_ref class that holds a char * and either the length, or another char * pointing to the end of the "string". – Jerry Coffin Sep 10 '12 at 03:43
  • @paddy This 200,000 records is just a very small sample. My original file has more than 100 million records. I think speed is important here. – POD Sep 10 '12 at 03:44
  • @Peddy I can get big enough RAM memory for that. It is around 10 gigabytes. – POD Sep 10 '12 at 03:47
2

Instead of boost::tokenizer<> tok(getcontent);, which constructs a new boost::tokenizer every call to getline. Use the assign member function:

boost::escaped_list_separator<char> sep( '\\', '*', '"' ) ;
boost::tokenizer<boost::escaped_list_separator<char>> tok(getcontent, sep);

// Other code
while(getline(openfile,getcontent))
{
    tok.assign(getcontent.begin(), getcontent.end()); // Use assign here
    line.assign(tok.begin(), tok.end()); // Instead of for-loop
    lines.push_back(line);
    counter++;
}

See if that helps. Also, try allocating the vector memory beforehand if possible.

Jesse Good
  • 46,179
  • 14
  • 109
  • 158
  • @POD: Sorry, I was using a C++11 language feature(`auto`), I changed the code back to what it would look like in C++03. – Jesse Good Sep 10 '12 at 03:14
  • @POD: I changed the `assign` to use `tok.assign(getcontent.begin(), getcontent.end());` instead because the `sep` never changes. – Jesse Good Sep 10 '12 at 03:22
  • @POD: I revised the code again to get rid of the for loop and `clear`. – Jesse Good Sep 10 '12 at 03:54
  • The code is very clean, but the performance is not different from mine. It took around 50 seconds to run. – POD Sep 10 '12 at 04:09
  • @POD: Yes, I'm not surprised. [See this link](http://www.codeproject.com/Articles/23198/C-String-Toolkit-StrTk-Tokenizer), they have benchmarks between `boost::tokenizer` and `StrTk`, usually boost::tokenizer is up to 3 times slower! – Jesse Good Sep 10 '12 at 04:17
2

Okay, from the comments it seems you want a solution that's as fast as possible.

Here's what I would do to achieve something close to that requirement.

While you could probably get a memory-pool allocator to allocate your strings, STL is not my strong point so I'm going to do it by hand. Beware this is not necessarily the C++ way to do it. So C++-heads might cringe a little. Sometimes you just have to do this when you want something a little specialised.

So, your data file is about 10 GB... Allocating that in a single block is a bad idea. Most likely your OS will refuse. But it's fine to break it up into a whole bunch of pretty big blocks. Maybe there's a magic number here, but let's say around about 64MB. People who are paging experts could comment here? I remember reading once that it's good to use a little less than an exact page-size multiple (though I can't recall why), so let's just rip off a few kB:

const size_t blockSize = 64 * 1048576 - 4096;

Now, how about a structure to track your memory? May as well make it a list so you can throw them all together.

struct Block {
    SBlock *next;
    char *data;    // Some APIs use data[1] so you can use the first element, but
                   // that's a hack that might not work on all compilers.
} SBlock;

Right, so you need to allocate a block - you'll allocate a large chunk of memory and use the first little bit to store some information. Note that you can change the data pointer if you need to align your memory:

SBlock * NewBlock( size_t blockSize, SBlock *prev = NULL )
{
    SBlock * b = (SBlock*)new char [sizeof(SBlock) + blockSize];
    if( prev != NULL ) prev->next = b;
    b->next = NULL;
    b->data = (char*)(blocks + 1);      // First char following struct
    b->length = blockSize;
    return b;
}

Now you're gonna read...

FILE *infile = fopen( "mydata.csv", "rb" );  // Told you C++ers would hate me

SBlock *blocks = NULL;
SBlock *block = NULL;
size_t spilloverBytes = 0;

while( !feof(infile) ) {
    // Allocate new block.  If there was spillover, a new block will already
    // be waiting so don't do anything.
    if( spilloverBytes == 0 ) block = NewBlock( blockSize, block );

    // Set list head.
    if( blocks == NULL ) blocks = block;

    // Read a block of data
    size_t nBytesReq = block->length - spilloverBytes;
    char* front = block->data + spilloverBytes;
    size_t nBytes = fread( (void*)front, 1, nBytesReq, infile );
    if( nBytes == 0 ) {
        block->length = spilloverBytes;
        break;
    }

    // Search backwards for a newline and treat all characters after that newline
    // as spillover -- they will be copied into the next block.
    char *back = front + nBytes - 1;
    while( back > front && *back != '\n' ) back--;
    back++;

    spilloverBytes = block->length - (back - front);
    block->length = back - block->data;

    // Transfer that data to a new block and resize current block.
    if( spilloverBytes > 0 ) {
        block = NewBlock( blockSize, block );
        memcpy( block->data, back, spilloverBytes );
    }
}

fclose(infile);

Okay, something like that. You get the jist. Note that at this point, you've probably read the file considerably faster than with multiple calls to std::getline. You can get faster still if you can disable any caching. In Windows you can use the CreateFile API and tweak it for real fast reads. Hence my earlier comment about potentially aligning your data blocks (to the disk sector size). Not sure about Linux or other OS.

So, this is a kind of complicated way to slurp an entire file into memory, but it's simple enough to be accessible and moderately flexible. Hopefully I didn't make too many errors. Now you just want to go through your list of blocks and start indexing them.

I'm not going to go into huge detail here, but the general idea is this. You tokenise in-place by blitzing NULL values at the appropriate places, and keeping track of where each token began.

SBlock *block = blocks;

while( block ) {
    char *c = block->data;
    char *back = c + block->length;
    char *token = NULL;

    // Find first token
    while( c != back ) {
        if( c != '"' && c != '*' ** c != '\n' ) break;
        c++;
    }
    token = c;

    // Tokenise entire block
    while( c != back ) {
        switch( *c ) {
            case '"':
                // For speed, we assume all closing quotes have opening quotes.  If
                // we have closing quote without opening quote, this won't be correct
                if( token != c) {
                    *c = 0;
                    token++;
                }
                break;

            case '*':
                // Record separator
                *c = 0;
                tokens.push_back(token);  // You can do better than this...
                token = c + 1;
                break;

            case '\n':
                // Record and line separator
                *c = 0;
                tokens.push_back(token);  // You can do better than this...
                lines.push_back(tokens);  // ... and WAY better than this...
                tokens.clear();           // Arrrgh!
                token = c + 1;
                break;
        }

        c++;
    }

    // Next block.
    block = block->next;
}

Finally, you'll see those vector-like calls above. Now, again if you can memory-pool your vectors that's great and easy. But once again, I just never do it because I find it a lot more intuitive to just work directly with memory. You can do something similar to what I did with the file chunks but create memory for arrays (or lists). You add all your tokens (which are just 8-byte pointers) to this memory area and add new chunks of memory as required.

You might even make a little header that keeps track of how many items are in one of these token arrays. The key is never to calculate something once that you can calculate later for no extra cost (ie an array size -- you only need to calculate that after you've added the last element).

You do the same again with lines. All you need is a pointer to the relevant part in a tokens chunk (and you have to do the spillover thing if a line eats into a new chunk if you are wanting array indexing).

What you'll end up with is an array of lines which point to arrays of tokens, which point directly into the memory you slurped out of the file.. And while there's a bit of memory wastage it's probably not excessive. It's the price you pay for making your code fast.

I'm sure it could all be wrapped up beautifully in a few simple classes, but I've given it to you raw here. Even if you made memory-pooled a bunch of STL containers, I expect the overhead of those allocators along with the containers themselves would still make it slower than what I've given you. Sorry about the really long answer. I guess I just enjoy this stuff. Have fun, and hope this helps.

paddy
  • 52,396
  • 6
  • 51
  • 93
  • @peddy This is a great and through explanation for memory allocation. – POD Sep 10 '12 at 16:13
  • No worries... By the way, don't underestimate the bottleneck of reading in the data. I've given you fast tokenising and avoided excessive small memory allocation which is a big improvement. But in my tests on Windows, the speed of `fread` was about 60% that of `CreateFile` with buffering/caching turned off. If your disk can deliver 100 MB/s optimized instead of 60 MB/s with buffering/caching, that's a big improvement. – paddy Sep 10 '12 at 21:51