repeated words from text file in c++

Question

This program finds words and erases all of them that are repeated in the text file. Here, I wrote this code to do so by inputting a specific word, but I want the program to find the kind of words by itself and in the result will show only the unrepeated words. I have tried by best but have failed. Could you advise me to find a better way?

int main()
{
   ifstream fin("name.txt");
   ofstream fout("result.txt");
   string word;
   string line;

   int pos;
   int ccount=0;;

   cout<<"Please input word:";
   cin>>word;

   while(getline(fin,line))
   {
       pos= 0;
       while(line.find(word,pos)!=string::npos)
       {
               pos = line.find(word,pos);
               line.erase(pos,word.length());
       }

        fout<<line<<endl;
   }
    fin.close();
    fout.close();
}

Can you load the words into a `std::vector` and nuke them with `std::unique`? Depending on your definition of repeated, mind you. Also watch out for [Why is iostream::eof inside a loop condition considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong) — user4581301, Jun 07 '18 at 02:28
There are multiple fundamental bugs in the shown code, the least of which is the fact that [eof in a loop condition is always a bug](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong), not to mention that if the word being searched is "pop", the shown code will conclude that a single line containing just the word "hippopotamus" contains that word. Before you can even begin the relatively mid-complexity task of removing individual words from a line, you need to get the task of reading the file correctly, and detecting eof correctly, done. — Sam Varshavchik, Jun 07 '18 at 02:28
@user4581301 std::unique doesn't guarantee that a single instance of the word will remain (unless sort is used before hand) as it only removes duplicates that are contiguous. I would recommend using a combination of std::vector (to keep the order) and std::set (to guarantee uniqueness). — A. Laribi, Jun 07 '18 at 02:47
@A.Laribi Yupppers. Depends on your definition of repeated. `std::unique` will pick off "The The" but not "The Bart. The". `std::set` is likely the right way to go. — user4581301, Jun 07 '18 at 02:50
@A. Laribi, could you show me how it is done ?,because i little bit confused — user8729892, Jun 07 '18 at 02:52
What is exactly a *word* and a *repeated word*. For example in *Bob eats an apple. Jane eats an apple too.*, *eats* is trivially a repeated word. But what about *apple*? Shall punctuations be specially processed? And should `appletree` be a repetition from *apple* (which your current code actually says...)? And are *apple*, *Apple* and *APPLE* the same word? Word processing can soon become more complex... — Serge Ballesta, Jun 07 '18 at 08:10
@SergeBallesta ，e.g, Bob eats apple .Jane eats an apple too, here the second 'eats' and 'apple' should be erased and first one should be output . apple ,APPLE,Apple are the same — user8729892, Jun 07 '18 at 09:44
Then you really should use the `` module to identifies word and convert them to a single case. If you simply add each identified word to a set, the set magic will keep one single version for each word (but will lose the insertion order...) — Serge Ballesta, Jun 07 '18 at 11:18
@Serge Ballesta ,how can i apply it to this ?, it seems to me much more advanced — user8729892, Jun 07 '18 at 12:37

A. Laribi · Answer 1 · 2018-06-07T07:20:18.133

1

You can use an std::vector to keep track of words as you read them and a std::set to make sure you only add it once to the std::vector. The reason you want the std::vector is because the std::set won't keep the std::strings in the order of insertion. This should do the job:

#include <algorithm>
#include <fstream>
#include <unordered_set>
#include <sstream>
#include <string>
#include <vector>

int main()
{
    std::vector<std::string> words;
    std::unordered_set<std::string> uniqueWords;

    std::ifstream fin("in.txt");
    if (fin.is_open())
    {
        std::string line;
        while (getline(fin, line))
        {
            std::istringstream iss(line);
            for (std::string word; iss >> word;)
            {
                std::string lowercaseWord = word;
                std::transform(word.begin(), word.end(), lowercaseWord.begin(), ::tolower);
                if (uniqueWords.find(lowercaseWord) == uniqueWords.end())
                {
                    uniqueWords.insert(lowercaseWord);
                    words.push_back(word);
                }
            }
        }

        fin.close();
    }

    std::ofstream fout("out.txt");
    if (fout.is_open())
    {
        for each (std::string word in words)
        {
            fout << word << " ";
        }

        fout.close();
    }

    return 0;
}

edited Jun 07 '18 at 07:20

answered Jun 07 '18 at 03:15

A. Laribi

106
3

1

std::unordered_set would be more efficient than std::set. – Gaurav Singh Jun 07 '18 at 03:45
@A. Laribi, I checked it out ,but the result is still same as before – user8729892 Jun 07 '18 at 04:34
@A. Laribi, i mean it should be erased all the same repeated words except for the first found one. – user8729892 Jun 07 '18 at 04:49
@user8729892 what's the content of your input file? – A. Laribi Jun 07 '18 at 06:55
@ A. Laribi ,it is a small part of the story (Romeo and Juliet) . – user8729892 Jun 07 '18 at 07:06
@user8729892 I'm not sure I understand the issue you're having. Is the code in my answer giving you an output text with repeated words? – A. Laribi Jun 07 '18 at 07:08
@ A. Laribi ,the output text is with repeated words – user8729892 Jun 07 '18 at 07:17
@user8729892 it could be due to a casing issue. I updated the answer to be case insensitive. – A. Laribi Jun 07 '18 at 07:18
@A. Laribi,how do you think if finding similar word should be deleted ? – user8729892 Jun 07 '18 at 08:41
@user8729892 are you sure you're looking at the correct output file (out.txt)? If you are then I don't know what's wrong here, you'll have to debug the code yourself. – A. Laribi Jun 07 '18 at 09:59
it is surely that one ,okay – user8729892 Jun 07 '18 at 11:00

score 0 · Answer 2 · answered Jun 07 '18 at 13:45

Ok it can be done with only reasonably complex tools: only the locale is required to convert all words to lower case and process punctuations.

Specifications:

a word is a sequence of non space characters neither starting nor ending with a punctuation
words are separated with each others by at least one space character
in order to identify unique words they are first converted to lower case
uniq word will be stored in a set: the insertion order is lost and only alphabetic order remains
spaces and punctuations are those of the default locale

Here is a possible code:

#include <iostream>
#include <fstream>
#include <string>
#include <set>
#include <locale>

int main() {
    std::ifstream in("name.txt");      // io streams
    std::ofstream out("result.txt")
    std::set<std::string> uniq_words;  // a set to store unique words

    // ctyp facet of default locale to identify punctuations
    std::locale loc;
    const std::ctype<char>& ctyp = std::use_facet<std::ctype<char> >(loc);

    std::string word;
    while (in >> word) {          // split text in words
        // ignore any punctuation at the beginning or the end of the word
        auto beg = word.begin();
        while ((beg != word.end()) && ctyp.is(ctyp.punct, *beg)) beg++;
        auto end = word.end();
        while ((end > beg) && ctyp.is(std::ctype<char>::punct, end[-1])) end--;

        for (auto it=beg; it != end; it++) {     // convert what is kept to lower case
            *it = ctyp.tolower(*it);
        }
        uniq_words.insert(std::string(beg, end));   // insert the lower case word in the set
    }
    // Ok, we have all unique words: write them to output file
    for (auto w: uniq_words) {
        out << w << " ";
    }
    out << std::endl;
    return 0;
}

repeated words from text file in c++

2 Answers2