-1

So I am trying to read data off a .txt file and then find the most common 30 words and print them out. However, whenever I'm reading my txt file, I receive the error:

"UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 338: ordinal not in range(128)".

Here is my code:

filename = 'wh_2015_national_security_strategy_obama.txt'
#catches the year of named in the file
year = filename[0:4]
ecount = 30
#opens the file and reads it
file = open(filename,'r').read()   #THIS IS WHERE THE ERROR IS
#counts the characters, then counts the lines, replaces the non word characters, slipts the list and changes it all to lower case.
numchar = len(file)
numlines = file.count('\n')
file = file.replace(",","").replace("'s","").replace("-","").replace(")","")
words = file.lower().split()
dictionary = {}
#this is a dictionary of all the words to not count for the most commonly used. 
dontcount = {"the", "of", "in", "to", "a", "and", "that", "we", "our", "is", "for", "at", "on", "as", "by", "be", "are", "will","this", "with", "or",
             "an", "-", "not", "than", "you", "your", "but","it","a","and", "i", "if","they","these","has","been","about","its","his","no"
             "because","when","would","was", "have", "their","all","should","from","most", "were","such","he", "very","which","may","because","--------"
             "had", "only", "no", "one", "--------", "any", "had", "other", "those", "us", "while",
             "..........", "*", "$", "so", "now","what", "who", "my","can", "who","do","could", "over", "-",
             "...............","................", "during","make","************",
             "......................................................................", "get", "how", "after",
             "..................................................", "...........................", "much", "some",
             "through","though","therefore","since","many", "then", "there", "–", "both", "them", "well", "me", "even", "also", "however"}
for w in words:
    if not w in dontcount:
        if w in dictionary:
            dictionary[w] +=1
        else:
            dictionary[w] = 1
num_words = sum(dictionary[w] for w in dictionary)
#This sorts the dictionary and makes it so that the most popular is at the top.
x = [(dictionary[w],w) for w in dictionary]
x.sort()
x.reverse()
#This prints out the number of characters, line, and words(not including stop words.
print(str(filename))
print('The file has ',numchar,' number of characters.')
print('The file has ',numlines,' number of lines.')
print('The file has ',num_words,' number of words.')
#This provides the stucture for how the most common words should be printed out
i = 1
for count, word in x[:ecount]:
    print("{0}, {1}, {2}".format(i,count,word))
    i+=1
Alastair McCormack
  • 23,069
  • 7
  • 60
  • 87
  • 1
    Possible duplicate http://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte & http://stackoverflow.com/questions/26619801/unicodedecodeerror-ascii-codec-cant-decode-byte-0x92-in-position-47-ordinal – Jaimes May 07 '16 at 02:03
  • See the post I linked to and the [Python 3 docs for `open`](https://docs.python.org/3/library/functions.html#open), especially its `encoding` parameter. For Python 2, the "new" version of `open` is in [`io.open`](https://docs.python.org/2/library/io.html#io.open). PS: That byte is most likely a nonstandard (Microsoft) right-single-quote, frequently misused as a "curly" apostrophe. – Kevin J. Chase May 07 '16 at 02:15
  • **It's none of the above** - all those questions and answers deal with Python 2. Not one will help the OP fix the very simple question relating to Python 3's TextIOWrapper throwing an exception, which has to be corrected by selecting the right encoding – Alastair McCormack May 07 '16 at 11:34

2 Answers2

2

In Python 3, when opening files in text mode (the default), Python uses your environment settings to choose an appropriate encoding.

If it can't resolve it (or your environment specifically defines ASCII), then it will use ASCII. This is what has happened in your case.

If the ASCII decoder finds anything that's not ASCII, then it will throw an error. In your case, it's thrown an error on the byte 0x92. This is not valid ASCII, nor valid UTF-8. It does make sense in windows-1252 encoding, however, where it's a (Smart quote / 'RIGHT SINGLE QUOTATION MARK'). It could also make sense in other 8bit code pages, but you'll have to know or work that out yourself.

To make your code read windows-1252 encoded files, you need to change your open() command to:

file = open(filename, 'r', encoding='windows-1252').read()
Alastair McCormack
  • 23,069
  • 7
  • 60
  • 87
-3

I am learning python, so please take this response with that in mind.

file = open(filename,'r').read() #THIS IS WHERE THE ERROR IS

From what I have learned so far your read is combined with the open() object creation. The open() function creates the file handle, the read() function reads the file into a string. Both functions would return I presume success/fail, or in the open() function's case in part the file object reference. I am not sure they can be combined successfully.

Thus far from what I have learned this is to be done in 2 steps. i.e.

file = open(filename, 'r') # creates the object myString = file.read() # reads the entire object into a string

the open() function creates the file object, so probably returns the object number, or success/fail.

The read, read(n), readline() or readlines() functions are used on the object.

.read reads entire file into a single string .read(n) read next n bytes into a string .readline() read the next line into a string .readline() read entire file into a list of strings

You can split them up and see if the same result happens ??? just a thought from a newbie :)

pyNewb
  • 1
  • Assigning a file-like object to a local variable before reading it does not change the contents of that file, nor how they are converted from bytes to strings, which is what caused the [`UnicodeDecodeError`](https://docs.python.org/3/library/exceptions.html#UnicodeDecodeError). See the `encoding` and `errors` parameters for [`open`](https://docs.python.org/3/library/io.html#io.open), and also the various `read`-related methods of the [`TextIOBase`](https://docs.python.org/3/library/io.html#io.TextIOBase) ("text file") it returns. – Kevin J. Chase May 07 '16 at 09:41