Index large txt file

Question

I have a large file (500 million records). The file is two columns(tab delimited) as follows:

All values in first column are ordered numerically (but with gaps e.g: 1 and then 3 and than 4...).

I would like to index that file to be able to access the value on column2 based on value from column 1 (that i will call key)

For example if i submit 8 it should return 9000.

I have started by creating an index as follows:

// Record each entry into a structure
struct Record{
  int gi; //first column
  int taxa; //second column
};

Record buffer;
ofstream BinaryFile("large_file_indexed.bin", ios::binary);
ifstream inputFile("infile.dat");

//Write to binary file
 while( inputFile.good() ){                     
        inputFile >> buffer.gi >> buffer.taxa;
        BinaryFile.write(  (char *) &buffer, sizeof(Record)  );    
        }
  BinaryFile.close();

Ok, what i´m doing above is just creating an binary index file for entries and save it to a binary file. This is working as expected.

The problem comes now, and since i´m not an expert i would appreciate your advice. The idea is to read the binary file and get a specific record

//Read binary file
ifstream ReadBinary("large_file_indexed.bin, ios::binary );
int idx = 8 ; // Which key do we search for?
 while(!ReadBinary.eof())
    {
      ReadBinary.read( (char *) &buffer, sizeof(Record));
      if(idx == buffer.gi) // If we find key return corresponding value
        {
          cout << "Found key " << buffer.gi << " Taxa:" << buffer.taxa <<  endl;
          break;
        }
    }

This returns the expected value. Since we are asking for value corresponding to key 8 it returns 9000.

The thing is that it still too long to get the value and i was wondering how can i be faster. If i use seekg and can get a specific index but i don´t know which index (position) corresponds to the key we want. So in other words can i directly jump to the position where the key is and get the corrsponding value. I´m confused on how to get the position for a particular key and jump to the corresponding position in the binary file. Maybe i should index my input file differently or i´m missing something ?

Thanks for your comments.

Off topic: `while(!ReadBinary.eof())` isn't what you want. Read more here: http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong — user4581301, Mar 14 '16 at 22:08
If the gaps aren't so large that you don't mind them making the file sparse, you can use `seekp` to move the write pointer by `sizeof(struct Record)` * the size of the gap, so that the position of each record in the file corresponds to the column 1 value. — Martin Broadhurst, Mar 14 '16 at 22:12
Should be able to get away with a dumb old binary search. Get the size of the file. Divide it by `sizeof(struct Record)` to get the number of records, and then start searching. Each index's location in the file will be `index * sizeof(struct Record)`. seek to that location, read Record::gi bytes, and search away. — user4581301, Mar 14 '16 at 22:21
@user4581301 I was going to make that suggestion, but I realized that binary search on a file will be relatively slow. The other suggestions of a database or gap filling are likely to be much better. — Mark Ransom, Mar 14 '16 at 22:48
@MarkRansom: Slow, sure, but not as slow as the linear search david is currently using. — Adrian McCarthy, Mar 14 '16 at 22:54
@MarkRansom I would add in-memory ranges. First, memory mapped file, then ordered pairs of (index, pointer). First search in the ranges table, then binary search between two pointers [start, end) — Severin Pappadeux, Mar 14 '16 at 23:03
Frankly, some fixed depth BST for in-memory indeces would be pretty good — Severin Pappadeux, Mar 14 '16 at 23:08
@MarkRansom Regardless of file system type or speed, binary search changes it from O(n) to O(log n). — Roddy, Mar 14 '16 at 23:20
@Roddy that's 1 billion ints, two per entry, for 4GB. Still doable if you have a 64-bit system, and will be doing many lookups. And of course O(log n) is better than O(n), but the database should get O(log n) too with lower overhead, and the other solution is O(1). — Mark Ransom, Mar 15 '16 at 00:12

Christophe · Answer 1 · 2016-03-14T23:47:54.413

If you can't use a database or a b-tree library, and don't want to invest in developing yet another b-tree library, you could consider one of the two following approaches.

Both assume that the binary index file is sorted, and take advantage of the fixed size record.

1.Simple heuristic approach

If there would be no gap, to find the n-th record (numbering starting at one) you would do:

if (ReadBinary.seekg(sizeof(Record)*(n-1))
     && ReadBinary.read( (char*)&buffer, sizeof(Record))) {
     // process record 
}
else {
    // record not found (certainly beyond eof)
}

But you can have gaps. This means, if there's no duplicate, the element n would be at this position or before. So just read and rewind as long as necessary:

if (! ReadBinary.seekg(sizeof(Record)*(n-1))) { // try to position 
    ReadBinary.clear(); // if couldn't position
    ReadBinary.seekg(-sizeof(Record), ios_base::end);  // go to last record
}
while (ReadBinary.read( (char*)&buffer, sizeof(Record)) && buffer.gi>n ) {
     ReadBinary.seekg (-2*sizeof(Record), ios_base::cur); 
}
if (ReadBinary && buffer.gi==n) {
        // record found
}
else {
    // record not found
}

2.Dichotomic approach

Of course, if you have many gaps this heuristic approach will quickly become too slow, as the number searched for increase.

You could therefore opt for a dichotomic search (aka binary search): with seekg() go to the end of the file and use tellg() to know the size of the file, that you could translate into number of records.

Cut the number into two, position on the record in the middle, read it, look if the searched number would be smaller or bigger than the number read, and restart with the new bounds of the search until you find the right position. The same principle you would use to search in an array.

This is very efficient, as you need only at most log(n)/log(2) reads to find any number. So for any of the 500 000 000 numbers, you'd need at most 29 reads !

3.Conclusions

Of course there are other feasible approaches as well. But in the end, this is already pretty good even if it would be outperformed by any database or a good crafted b-tree library, because b-trees reduce disk head movement by an astute regrouping of nodes into blocks that are optimized to be read at once with a minimal disk overhead. This reduces the number of disk access to log(n)/log(b) where b is the number of nodes in a block. For example if b=10, searching the 500 000 000 elements would require at most 9 reads from disk.

Index large txt file

1 Answers1