I have a large file (500 million records). The file is two columns(tab delimited) as follows:
1 4590
3 1390
4 4590
5 4285
7 8902
8 9000
...
All values in first column are ordered numerically (but with gaps e.g: 1 and then 3 and than 4...).
I would like to index that file to be able to access the value on column2 based on value from column 1 (that i will call key)
For example if i submit 8 it should return 9000.
I have started by creating an index as follows:
// Record each entry into a structure
struct Record{
int gi; //first column
int taxa; //second column
};
Record buffer;
ofstream BinaryFile("large_file_indexed.bin", ios::binary);
ifstream inputFile("infile.dat");
//Write to binary file
while( inputFile.good() ){
inputFile >> buffer.gi >> buffer.taxa;
BinaryFile.write( (char *) &buffer, sizeof(Record) );
}
BinaryFile.close();
Ok, what i´m doing above is just creating an binary index file for entries and save it to a binary file. This is working as expected.
The problem comes now, and since i´m not an expert i would appreciate your advice. The idea is to read the binary file and get a specific record
//Read binary file
ifstream ReadBinary("large_file_indexed.bin, ios::binary );
int idx = 8 ; // Which key do we search for?
while(!ReadBinary.eof())
{
ReadBinary.read( (char *) &buffer, sizeof(Record));
if(idx == buffer.gi) // If we find key return corresponding value
{
cout << "Found key " << buffer.gi << " Taxa:" << buffer.taxa << endl;
break;
}
}
This returns the expected value. Since we are asking for value corresponding to key 8 it returns 9000.
The thing is that it still too long to get the value and i was wondering how can i be faster. If i use seekg and can get a specific index but i don´t know which index (position) corresponds to the key we want. So in other words can i directly jump to the position where the key is and get the corrsponding value. I´m confused on how to get the position for a particular key and jump to the corresponding position in the binary file. Maybe i should index my input file differently or i´m missing something ?
Thanks for your comments.