string matching from unicode text file? python

Question

import re, codecs
import string
import sys
stopwords=codecs.open('stopwords_harkat1.txt','r','utf_8')
lines=codecs.open('Corpus_v2.txt','r','utf_8')
for line in lines:
    line = line.rstrip().lstrip()
    #print line
    tokens = line.split('\t')
    token=tokens[4]

    if token in stopwords:
            print token

this code has no errors but it not work for string matching from different files.anyone can help me please?

$ i also tried the method match but doesn't work

falsetru · Answer 1 · 2014-07-29T14:57:30.483

0

You need to load contents file, not only open it.

Replace following line:

stopwords = codecs.open('stopwords_harkat1.txt','r','utf_8')

with:

with codecs.open('stopwords_harkat1.txt','r','utf_8') as f:
    # assuming one stop word in one line.
    stopwords = set(line.strip() for line in f)

    # Otherwise, use the following line
    # stopwords = set(word for line in f for word in line.split())

edited Jul 29 '14 at 14:57

answered Jul 26 '14 at 04:39

falsetru

314,667
49
610
551

i try it but this error present: Traceback (most recent call last): File "C:\Users\Desktop\remove stop words\remove\remove.py", line 7, in with open(codecs.open('stopwords_harkat1.txt','r','utf_8'))as f: TypeError: coercing to Unicode: need string or buffer, instance found – msm Jul 29 '14 at 14:38
@msm, The leading `open(` was typo. I updated the answer. Please check it out. – falsetru Jul 29 '14 at 14:57

string matching from unicode text file? python

1 Answers1