0

This program finds words and erases all of them that are repeated in the text file. Here, I wrote this code to do so by inputting a specific word, but I want the program to find the kind of words by itself and in the result will show only the unrepeated words. I have tried by best but have failed. Could you advise me to find a better way?

int main()
{
   ifstream fin("name.txt");
   ofstream fout("result.txt");
   string word;
   string line;

   int pos;
   int ccount=0;;

   cout<<"Please input word:";
   cin>>word;

   while(getline(fin,line))
   {
       pos= 0;
       while(line.find(word,pos)!=string::npos)
       {
               pos = line.find(word,pos);
               line.erase(pos,word.length());
       }

        fout<<line<<endl;
   }
    fin.close();
    fout.close();
}
  • 1
    Can you load the words into a `std::vector` and nuke them with `std::unique`? Depending on your definition of repeated, mind you. Also watch out for [Why is iostream::eof inside a loop condition considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong) – user4581301 Jun 07 '18 at 02:28
  • 1
    There are multiple fundamental bugs in the shown code, the least of which is the fact that [eof in a loop condition is always a bug](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong), not to mention that if the word being searched is "pop", the shown code will conclude that a single line containing just the word "hippopotamus" contains that word. Before you can even begin the relatively mid-complexity task of removing individual words from a line, you need to get the task of reading the file correctly, and detecting eof correctly, done. – Sam Varshavchik Jun 07 '18 at 02:28
  • 2
    @user4581301 std::unique doesn't guarantee that a single instance of the word will remain (unless sort is used before hand) as it only removes duplicates that are contiguous. I would recommend using a combination of std::vector (to keep the order) and std::set (to guarantee uniqueness). – A. Laribi Jun 07 '18 at 02:47
  • @A.Laribi Yupppers. Depends on your definition of repeated. `std::unique` will pick off "The The" but not "The Bart. The". `std::set` is likely the right way to go. – user4581301 Jun 07 '18 at 02:50
  • @A. Laribi, could you show me how it is done ?,because i little bit confused – user8729892 Jun 07 '18 at 02:52
  • What is exactly a *word* and a *repeated word*. For example in *Bob eats an apple. Jane eats an apple too.*, *eats* is trivially a repeated word. But what about *apple*? Shall punctuations be specially processed? And should `appletree` be a repetition from *apple* (which your current code actually says...)? And are *apple*, *Apple* and *APPLE* the same word? Word processing can soon become more complex... – Serge Ballesta Jun 07 '18 at 08:10
  • @SergeBallesta ,e.g, Bob eats apple .Jane eats an apple too, here the second 'eats' and 'apple' should be erased and first one should be output . apple ,APPLE,Apple are the same – user8729892 Jun 07 '18 at 09:44
  • Then you really should use the `` module to identifies word and convert them to a single case. If you simply add each identified word to a set, the set magic will keep one single version for each word (but will lose the insertion order...) – Serge Ballesta Jun 07 '18 at 11:18
  • @Serge Ballesta ,how can i apply it to this ?, it seems to me much more advanced – user8729892 Jun 07 '18 at 12:37

2 Answers2

1

You can use an std::vector to keep track of words as you read them and a std::set to make sure you only add it once to the std::vector. The reason you want the std::vector is because the std::set won't keep the std::strings in the order of insertion. This should do the job:

#include <algorithm>
#include <fstream>
#include <unordered_set>
#include <sstream>
#include <string>
#include <vector>

int main()
{
    std::vector<std::string> words;
    std::unordered_set<std::string> uniqueWords;

    std::ifstream fin("in.txt");
    if (fin.is_open())
    {
        std::string line;
        while (getline(fin, line))
        {
            std::istringstream iss(line);
            for (std::string word; iss >> word;)
            {
                std::string lowercaseWord = word;
                std::transform(word.begin(), word.end(), lowercaseWord.begin(), ::tolower);
                if (uniqueWords.find(lowercaseWord) == uniqueWords.end())
                {
                    uniqueWords.insert(lowercaseWord);
                    words.push_back(word);
                }
            }
        }

        fin.close();
    }

    std::ofstream fout("out.txt");
    if (fout.is_open())
    {
        for each (std::string word in words)
        {
            fout << word << " ";
        }

        fout.close();
    }

    return 0;
}
A. Laribi
  • 106
  • 3
0

Ok it can be done with only reasonably complex tools: only the locale is required to convert all words to lower case and process punctuations.

Specifications:

  • a word is a sequence of non space characters neither starting nor ending with a punctuation
  • words are separated with each others by at least one space character
  • in order to identify unique words they are first converted to lower case
  • uniq word will be stored in a set: the insertion order is lost and only alphabetic order remains
  • spaces and punctuations are those of the default locale

Here is a possible code:

#include <iostream>
#include <fstream>
#include <string>
#include <set>
#include <locale>

int main() {
    std::ifstream in("name.txt");      // io streams
    std::ofstream out("result.txt")
    std::set<std::string> uniq_words;  // a set to store unique words

    // ctyp facet of default locale to identify punctuations
    std::locale loc;
    const std::ctype<char>& ctyp = std::use_facet<std::ctype<char> >(loc);

    std::string word;
    while (in >> word) {          // split text in words
        // ignore any punctuation at the beginning or the end of the word
        auto beg = word.begin();
        while ((beg != word.end()) && ctyp.is(ctyp.punct, *beg)) beg++;
        auto end = word.end();
        while ((end > beg) && ctyp.is(std::ctype<char>::punct, end[-1])) end--;

        for (auto it=beg; it != end; it++) {     // convert what is kept to lower case
            *it = ctyp.tolower(*it);
        }
        uniq_words.insert(std::string(beg, end));   // insert the lower case word in the set
    }
    // Ok, we have all unique words: write them to output file
    for (auto w: uniq_words) {
        out << w << " ";
    }
    out << std::endl;
    return 0;
}
Serge Ballesta
  • 121,548
  • 10
  • 94
  • 199