13

mycorpus.txt

Human where's machine interface for lab abc computer applications   
A where's survey of user opinion of computer system response time

stopwords.txt

let's
ain't
there's

The following code

corpus = set()
for line in open("path\\to\\mycorpus.txt"):
    corpus.update(set(line.lower().split()))
print corpus

stoplist = set()
for line in open("C:\\Users\\Pankaj\\Desktop\\BTP\\stopwords_new.txt"):
    stoplist.add(line.lower().strip())
print stoplist

gives the following output

set(['a', "where's", 'abc', 'for', 'of', 'system', 'lab', 'machine', 'applications', 'computer', 'survey', 'user', 'human', 'time', 'interface', 'opinion', 'response'])
set(['let\x92s', 'ain\x92t', 'there\x92s'])

Why is the apostrophe turning into \x92 in the 2nd set??

Pankaj Singhal
  • 12,388
  • 7
  • 37
  • 74
  • 1
    Never use Microsoft's editors if you want to write ASCII's texts. If you want to use them, then you have to handle cp1252(which includes also that "right-quotation mark"). – Bakuriu Mar 22 '13 at 06:42

1 Answers1

18

Code point 92(hex) in window-1252 encoding is Unicode code point 2019(hex) which is 'RIGHT SINGLE QUOTATION MARK'. This looks very like an apostrophe and is likely to be the actual character that you have in stopwords.txt, which I've guessed from the way python has interpreted in, has be encoded in windows-1252 or an encoding that shares ASCII and codepoint values.

' vs ’

CB Bailey
  • 648,528
  • 94
  • 608
  • 638
  • then in the 1st set why is it showing "where's" instead of 'where\x92s'?? – Pankaj Singhal Mar 22 '13 at 06:42
  • @PankajSinghal: Probably because you genuinely have the ASCII apostrophe character in the first file. To confirm this use a tool such as hexdump to verify the actual bytes in both of your files. – CB Bailey Mar 22 '13 at 06:44
  • ya, i see there's a difference in characters. So what should I do to make it read like "ain't" and not 'ain\x92t' ??? – Pankaj Singhal Mar 22 '13 at 06:48
  • 1
    @PankajSinghal: The simplest thing to do would be to edit `stop words.txt` and replace all `’` with `'` using a text editor or `sed` or similar. – CB Bailey Mar 22 '13 at 06:56