1

I have been searching over internet, but could not find any existing tools for extracting words from a file with a specific delimiter in C++. Does anyone know an already existing library or code in C++ that does the job. Given below is what I wanted to achieve :

  • Objective : to extract words from a file using a delimiter
  • Words type : words can be made of any combination of unsigned characters (within UTF-8 encoding set). So, \0 should also be considered as a character. And only delimiter should be able to separate any two words from each other.
  • File type : text file

I have tried the following code :

#include <iostream>
using std::cout;
using std::endl;

#include <fstream>
using std::ifstream;

#include <cstring>

const int MAX_TOKENS_PER_FILE = 100000;
const int MAX_CHARS_PER_LINE = 512;
const int MAX_TOKENS_PER_LINE = 256;
const char* const DELIMITER = " ";

int main()
{
  int index = 0, keyword_num = 0;

  // stores all the words that are in a file
  unsigned char *keywords_extracted[MAX_TOKENS_PER_FILE];    

  // create a file-reading object
  ifstream fin;
  fin.open("data.txt"); // open a file
  if (!fin.good()) 
    return 1; // exit if file not found

  // read each line of the file
  while (!fin.eof())
  {
    // read an entire line into memory
    char buf[MAX_CHARS_PER_LINE];
    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const char* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = strtok(buf, DELIMITER); // first token
    if (token[0]) // zero if line is blank
    {
      keywords_extracted[keyword_num] = (unsigned char *)token[0];
      keyword_num++;

      for (n = 1; n < MAX_TOKENS_PER_LINE; n++)
      {
        token[n] = strtok(0, DELIMITER); // subsequent tokens
        if (!token[n]) break; // no more tokens
            keywords_extracted[keyword_num] = (unsigned char *)token[n];
            keyword_num++;
      }
    }

  }
    // process (print) the tokens
    for(index=0;index<keyword_num;index++)
        cout << keywords_extracted[index] << endl;
}

But I have a problem from the above code :

  • The first word/entry in keywords_extracted is being replaced with '0' as the the content of the last line the program reads is empty.(correct me if i'm doing/assuming anything wrong).

Is there a way to overcome this problem in the above code or are any other existing libraries for this functionality? Sorry for lengthy explanation, just trying to be clear.

timrau
  • 21,494
  • 4
  • 47
  • 62
annunarcist
  • 1,292
  • 2
  • 14
  • 37

1 Answers1

2

std::getline takes a delimiter (3rd argument) which can be different than the default '\n' -- does that not work for you?

Example;

std::string word;
while (std::getline(fin, word, '|')) {
   std::cout << word;
}

should read and print every word using pipe (|) as the seperator

Soren
  • 13,623
  • 4
  • 34
  • 66