I have been searching over internet, but could not find any existing tools for extracting words from a file with a specific delimiter in C++. Does anyone know an already existing library or code in C++ that does the job. Given below is what I wanted to achieve :
- Objective : to extract words from a file using a delimiter
- Words type : words can be made of any combination of unsigned characters (within UTF-8 encoding set). So,
\0
should also be considered as a character. And only delimiter should be able to separate any two words from each other. - File type : text file
I have tried the following code :
#include <iostream>
using std::cout;
using std::endl;
#include <fstream>
using std::ifstream;
#include <cstring>
const int MAX_TOKENS_PER_FILE = 100000;
const int MAX_CHARS_PER_LINE = 512;
const int MAX_TOKENS_PER_LINE = 256;
const char* const DELIMITER = " ";
int main()
{
int index = 0, keyword_num = 0;
// stores all the words that are in a file
unsigned char *keywords_extracted[MAX_TOKENS_PER_FILE];
// create a file-reading object
ifstream fin;
fin.open("data.txt"); // open a file
if (!fin.good())
return 1; // exit if file not found
// read each line of the file
while (!fin.eof())
{
// read an entire line into memory
char buf[MAX_CHARS_PER_LINE];
fin.getline(buf, MAX_CHARS_PER_LINE);
// parse the line into blank-delimited tokens
int n = 0; // a for-loop index
// array to store memory addresses of the tokens in buf
const char* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0
// parse the line
token[0] = strtok(buf, DELIMITER); // first token
if (token[0]) // zero if line is blank
{
keywords_extracted[keyword_num] = (unsigned char *)token[0];
keyword_num++;
for (n = 1; n < MAX_TOKENS_PER_LINE; n++)
{
token[n] = strtok(0, DELIMITER); // subsequent tokens
if (!token[n]) break; // no more tokens
keywords_extracted[keyword_num] = (unsigned char *)token[n];
keyword_num++;
}
}
}
// process (print) the tokens
for(index=0;index<keyword_num;index++)
cout << keywords_extracted[index] << endl;
}
But I have a problem from the above code :
- The first word/entry in keywords_extracted is being replaced with '0' as the the content of the last line the program reads is empty.(correct me if i'm doing/assuming anything wrong).
Is there a way to overcome this problem in the above code or are any other existing libraries for this functionality? Sorry for lengthy explanation, just trying to be clear.