Okay, from the comments it seems you want a solution that's as fast as possible.
Here's what I would do to achieve something close to that requirement.
While you could probably get a memory-pool allocator to allocate your strings, STL is not my strong point so I'm going to do it by hand. Beware this is not necessarily the C++ way to do it. So C++-heads might cringe a little. Sometimes you just have to do this when you want something a little specialised.
So, your data file is about 10 GB... Allocating that in a single block is a bad idea. Most likely your OS will refuse. But it's fine to break it up into a whole bunch of pretty big blocks. Maybe there's a magic number here, but let's say around about 64MB. People who are paging experts could comment here? I remember reading once that it's good to use a little less than an exact page-size multiple (though I can't recall why), so let's just rip off a few kB:
const size_t blockSize = 64 * 1048576 - 4096;
Now, how about a structure to track your memory? May as well make it a list so you can throw them all together.
struct Block {
SBlock *next;
char *data; // Some APIs use data[1] so you can use the first element, but
// that's a hack that might not work on all compilers.
} SBlock;
Right, so you need to allocate a block - you'll allocate a large chunk of memory and use the first little bit to store some information. Note that you can change the data
pointer if you need to align your memory:
SBlock * NewBlock( size_t blockSize, SBlock *prev = NULL )
{
SBlock * b = (SBlock*)new char [sizeof(SBlock) + blockSize];
if( prev != NULL ) prev->next = b;
b->next = NULL;
b->data = (char*)(blocks + 1); // First char following struct
b->length = blockSize;
return b;
}
Now you're gonna read...
FILE *infile = fopen( "mydata.csv", "rb" ); // Told you C++ers would hate me
SBlock *blocks = NULL;
SBlock *block = NULL;
size_t spilloverBytes = 0;
while( !feof(infile) ) {
// Allocate new block. If there was spillover, a new block will already
// be waiting so don't do anything.
if( spilloverBytes == 0 ) block = NewBlock( blockSize, block );
// Set list head.
if( blocks == NULL ) blocks = block;
// Read a block of data
size_t nBytesReq = block->length - spilloverBytes;
char* front = block->data + spilloverBytes;
size_t nBytes = fread( (void*)front, 1, nBytesReq, infile );
if( nBytes == 0 ) {
block->length = spilloverBytes;
break;
}
// Search backwards for a newline and treat all characters after that newline
// as spillover -- they will be copied into the next block.
char *back = front + nBytes - 1;
while( back > front && *back != '\n' ) back--;
back++;
spilloverBytes = block->length - (back - front);
block->length = back - block->data;
// Transfer that data to a new block and resize current block.
if( spilloverBytes > 0 ) {
block = NewBlock( blockSize, block );
memcpy( block->data, back, spilloverBytes );
}
}
fclose(infile);
Okay, something like that. You get the jist. Note that at this point, you've probably read the file considerably faster than with multiple calls to std::getline
. You can get faster still if you can disable any caching. In Windows you can use the CreateFile
API and tweak it for real fast reads. Hence my earlier comment about potentially aligning your data blocks (to the disk sector size). Not sure about Linux or other OS.
So, this is a kind of complicated way to slurp an entire file into memory, but it's simple enough to be accessible and moderately flexible. Hopefully I didn't make too many errors. Now you just want to go through your list of blocks and start indexing them.
I'm not going to go into huge detail here, but the general idea is this. You tokenise in-place by blitzing NULL values at the appropriate places, and keeping track of where each token began.
SBlock *block = blocks;
while( block ) {
char *c = block->data;
char *back = c + block->length;
char *token = NULL;
// Find first token
while( c != back ) {
if( c != '"' && c != '*' ** c != '\n' ) break;
c++;
}
token = c;
// Tokenise entire block
while( c != back ) {
switch( *c ) {
case '"':
// For speed, we assume all closing quotes have opening quotes. If
// we have closing quote without opening quote, this won't be correct
if( token != c) {
*c = 0;
token++;
}
break;
case '*':
// Record separator
*c = 0;
tokens.push_back(token); // You can do better than this...
token = c + 1;
break;
case '\n':
// Record and line separator
*c = 0;
tokens.push_back(token); // You can do better than this...
lines.push_back(tokens); // ... and WAY better than this...
tokens.clear(); // Arrrgh!
token = c + 1;
break;
}
c++;
}
// Next block.
block = block->next;
}
Finally, you'll see those vector-like calls above. Now, again if you can memory-pool your vectors that's great and easy. But once again, I just never do it because I find it a lot more intuitive to just work directly with memory. You can do something similar to what I did with the file chunks but create memory for arrays (or lists). You add all your tokens (which are just 8-byte pointers) to this memory area and add new chunks of memory as required.
You might even make a little header that keeps track of how many items are in one of these token arrays. The key is never to calculate something once that you can calculate later for no extra cost (ie an array size -- you only need to calculate that after you've added the last element).
You do the same again with lines. All you need is a pointer to the relevant part in a tokens chunk (and you have to do the spillover thing if a line eats into a new chunk if you are wanting array indexing).
What you'll end up with is an array of lines which point to arrays of tokens, which point directly into the memory you slurped out of the file.. And while there's a bit of memory wastage it's probably not excessive. It's the price you pay for making your code fast.
I'm sure it could all be wrapped up beautifully in a few simple classes, but I've given it to you raw here. Even if you made memory-pooled a bunch of STL containers, I expect the overhead of those allocators along with the containers themselves would still make it slower than what I've given you. Sorry about the really long answer. I guess I just enjoy this stuff. Have fun, and hope this helps.