2

I'm trying to create a spam filter. I need to train the model first. I read the words from a text file which has the word "spam" or "ham" as the first word of a paragraph, and then the words in the mail and number of its occurrences just after the word. There are paragraphs in the file. My program is able to read the first paragraph that is the words and their number of occurrences.

The problem is that, the file stops reading after encountering the newline that and doesn't read the next paragraph. Although I have a feeling that the way I am checking for a newline character that is the end of a paragraph is not entirely correct.

I have given two paragraphs so you just get the idea of the train text. Train text file.

/000/003 ham need 1 fw 1 35 2 39 1 thanks 1 thread 2 40 1 copy 1 else 1 correlator 1 under 1 companies 1 25 1 he 2 26 2 168 1 29 2 content 4 1 1 6 1 5 1 4 1 review 2 we 1 john 3 17 1 use 1 15 1 20 1 classes 1 may 1 a 1 back 1 l 1 01 1 produced 1 i 1 yes 1 10 2 713 2 v6 1 p 1 original 2

/000/031 ham don 1 kim 5 dave 1 39 1 customer 1 38 2 thanks 1 over 1 thread 2 year 1 correlator 1 under 1 williams 1 mon 2 number 2 kitchen 1 168 1 29 1 content 4 3 2 2 6 system 2 1 2 7 1 6 1 5 2 4 1 9 1 each 1 8 1 view 2

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main()
{
    int V = 0; // Total number of words

    ifstream fin;
    fin.open("train", ios::in);
    string word;
    int wordnum;
    int N[2] = {0};
    char c, skip;
    for (int i = 0; i < 8; i++) fin >> skip; // There are 8 characters before the first word of the paragraph
    while (!fin.fail())
    {
        fin >> word;
        if (word == "spam") N[0]++;
        else if (word == "ham") N[1]++;
        else
        {
            V++;
            fin >> wordnum;
        }
        int p = fin.tellg();
        fin >> c; //To check for newline. If its there, we skip the first eight characters of the new paragraph because those characters aren't supposed to be read
        if (c == '\n')
        {
            for (int i = 0; i < 8; i++) fin >> skip;
        }
        else fin.seekg(p);
    }

    cout << "\nSpam: " << N[0];
    cout << "\nHam :" << N[1];
    cout << "\nVocab: " << V;

    fin.close();

    return 0;
}
Cherry Cool
  • 69
  • 1
  • 7

1 Answers1

1

std::ifstream::operator>>() doesn't read \n in the variable; it drops it. If you need to manipulate with whitespaces and \n symbols, you can use std::ifstream::get()

Keith Pinson
  • 7,162
  • 5
  • 54
  • 97
Kastaneda
  • 679
  • 1
  • 7
  • 14