2

The problem I am experiencing is that in my code, I am not able to get individual words/tokens to match with the stop words to remove from the original text. Instead, I am getting a whole sentence and hence am not able to match it with the stop words. Please show me a way by which I can get individual tokens and then match those with stop words and remove them. Please help me.

from nltk.corpus import stopwords
import string, os
def remove_stopwords(ifile):
    processed_word_list = []
    stopword = stopwords.words("urdu")
    text = open(ifile, 'r').readlines()
    for word in text:
         print(word)
         if word  not in stopword:
                processed_word_list.append('*')
                print(processed_word_list)
                return processed_word_list

if __name__ == "__main__":
    print ("Input file path: ")
    ifile = input()
    remove_stopwords(ifile)
elf
  • 1,688
  • 1
  • 18
  • 27
user3778289
  • 303
  • 3
  • 17
  • The reason you're not getting the words in the text is because you're using the `readlines()` function. This gives you an iterable of the lines/sentences in the file, then when you say `for word in text:` you are getting the lines one by one. – Jake Conkerton-Darby Aug 10 '17 at 15:19

1 Answers1

2

Try this instead:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string, os, ast
def remove_stopwords(ifile):
    processed_word_list = []
    stopword = stopwords.words("urdu")
    words = ast.literal_eval(open(ifile, 'r').read())
    for word in words:
        print(word)
        if word not in stopword:
            processed_word_list.append('*')
        else:
            processed_word_list.append(word)
    print(processed_word_list)
    return processed_word_list

if __name__ == "__main__":
    print ("Input file path: ")
    ifile = input()
    remove_stopwords(ifile)
M3RS
  • 4,254
  • 1
  • 24
  • 40
  • 1
    This won't work as `line` is a string, thus you will iterate over the characters in `line`. Swap `line` for `line.split()` though and we're good to go. – Jake Conkerton-Darby Aug 10 '17 at 15:17
  • This code is only giving me the first word after that it terminates. I am not able to get the whole list instead just the first word in the file. I want it to iterate and match all the words in the given text file to the stop words and the show me the list without stop words or the stop words removed. – user3778289 Aug 10 '17 at 16:21
  • also the .split() function makes tokens while the file i am providing is already tokenized. – user3778289 Aug 10 '17 at 16:28
  • The reason it exited after the first word was because the `return` statement has to be outside the `for` loop. I edited the above code. It works for me now. – M3RS Aug 10 '17 at 16:43
  • Do you mean your input file already has one word per line? In that case the above can be simplified, but it should work nonetheless. – M3RS Aug 10 '17 at 16:47
  • my input file is already tokenized.for example ['this','is','an','apple'].i already have my file like this.if i use the split function again it will tokenize the tokens – user3778289 Aug 10 '17 at 17:03
  • Does the above not work? How are the rows set up? Would help if you copied in the first few rows of the input file. – M3RS Aug 10 '17 at 17:22
  • this is the first row of my input life".word_tokenize(line)".this function will tokenize my input file further.and it will not remove stopword – user3778289 Aug 10 '17 at 18:13
  • yes i tried this."processed_word_list.append('*')".this line is not needed.so what i did is if word not in stopword: processed_word_list.append(word).and it is just tokenizing the doc further – user3778289 Aug 10 '17 at 18:17
  • I have changed the code above. Removed `word_tokenize` and strip off unwanted characters. – M3RS Aug 10 '17 at 18:26
  • i want to ask one thing why we are removing the already tokenized words and then again make token of each word.[x.lstrip("['").rstrip("',]") for x in line.split()].there is no other way to read each token and then remove stopword? – user3778289 Aug 10 '17 at 18:56
  • This can be done with `ast.literal_eval()`. I updated the code. Also see here: https://stackoverflow.com/questions/1894269/convert-string-representation-of-list-to-list-in-python – M3RS Aug 10 '17 at 20:43
  • thank you for your response.but this is giving me an error "SyntaxError: unexpected EOF while parsing". – user3778289 Aug 11 '17 at 10:08
  • also my input is like this.['This', 'is', 'a','toy'.'He','is','a','boy'] so i just want to read each value from the list of the string say 'this' and then check it in stopwords file and then next value and so on – user3778289 Aug 11 '17 at 10:10
  • my input is like this but ['this' ,'is', ].some thing like this.like one [ at the beginning and end].but now the output is not correct. – user3778289 Aug 11 '17 at 10:16