extract words (made of unsigned characters) from a file with a delimiter in c++

Question

I have been searching over internet, but could not find any existing tools for extracting words from a file with a specific delimiter in C++. Does anyone know an already existing library or code in C++ that does the job. Given below is what I wanted to achieve :

Objective : to extract words from a file using a delimiter
Words type : words can be made of any combination of unsigned characters (within UTF-8 encoding set). So, \0 should also be considered as a character. And only delimiter should be able to separate any two words from each other.
File type : text file

I have tried the following code :

#include <iostream>
using std::cout;
using std::endl;

#include <fstream>
using std::ifstream;

#include <cstring>

const int MAX_TOKENS_PER_FILE = 100000;
const int MAX_CHARS_PER_LINE = 512;
const int MAX_TOKENS_PER_LINE = 256;
const char* const DELIMITER = " ";

int main()
{
  int index = 0, keyword_num = 0;

  // stores all the words that are in a file
  unsigned char *keywords_extracted[MAX_TOKENS_PER_FILE];    

  // create a file-reading object
  ifstream fin;
  fin.open("data.txt"); // open a file
  if (!fin.good()) 
    return 1; // exit if file not found

  // read each line of the file
  while (!fin.eof())
  {
    // read an entire line into memory
    char buf[MAX_CHARS_PER_LINE];
    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const char* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = strtok(buf, DELIMITER); // first token
    if (token[0]) // zero if line is blank
    {
      keywords_extracted[keyword_num] = (unsigned char *)token[0];
      keyword_num++;

      for (n = 1; n < MAX_TOKENS_PER_LINE; n++)
      {
        token[n] = strtok(0, DELIMITER); // subsequent tokens
        if (!token[n]) break; // no more tokens
            keywords_extracted[keyword_num] = (unsigned char *)token[n];
            keyword_num++;
      }
    }

  }
    // process (print) the tokens
    for(index=0;index<keyword_num;index++)
        cout << keywords_extracted[index] << endl;
}

But I have a problem from the above code :

The first word/entry in keywords_extracted is being replaced with '0' as the the content of the last line the program reads is empty.(correct me if i'm doing/assuming anything wrong).

Is there a way to overcome this problem in the above code or are any other existing libraries for this functionality? Sorry for lengthy explanation, just trying to be clear.

Try `fin.open("data.txt", ios::binary)` first. Otherwise, the file input may be stopped at `EOF` character. — timrau, Nov 13 '13 at 17:08
Is there FAQ on why `while( !fin.eof() )` is wrong? I see such mistake every day here. — Slava, Nov 13 '13 at 17:11
@Slava: Will this work: http://www.parashift.com/c++-faq/input-output.html — Thomas Matthews, Nov 13 '13 at 20:15
You should use `std::string` as it works well with `getline()`. — Thomas Matthews, Nov 13 '13 at 20:16
Your title says `unsigned char` but your code uses `char`. Which is it? — Thomas Matthews, Nov 13 '13 at 20:16
@ThomasMatthews thanks. I found it http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong as well — Slava, Nov 13 '13 at 20:49
Well, I want it for words made of unsigned characters, but since strtok() is returning only char pointer, so i had to use char pointer & explicitly typecast it to unsigned char pointer. — annunarcist, Nov 14 '13 at 05:53

Soren · Answer 1 · 2013-11-13T17:19:58.273

2

std::getline takes a delimiter (3rd argument) which can be different than the default '\n' -- does that not work for you?

Example;

std::string word;
while (std::getline(fin, word, '|')) {
   std::cout << word;
}

should read and print every word using pipe (|) as the seperator

edited Nov 13 '13 at 17:19

answered Nov 13 '13 at 17:13

Soren

13,623
4
34
66

extract words (made of unsigned characters) from a file with a delimiter in c++

1 Answers1