So I am trying to read data off a .txt file and then find the most common 30 words and print them out. However, whenever I'm reading my txt file, I receive the error:
"UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 338: ordinal not in range(128)".
Here is my code:
filename = 'wh_2015_national_security_strategy_obama.txt'
#catches the year of named in the file
year = filename[0:4]
ecount = 30
#opens the file and reads it
file = open(filename,'r').read() #THIS IS WHERE THE ERROR IS
#counts the characters, then counts the lines, replaces the non word characters, slipts the list and changes it all to lower case.
numchar = len(file)
numlines = file.count('\n')
file = file.replace(",","").replace("'s","").replace("-","").replace(")","")
words = file.lower().split()
dictionary = {}
#this is a dictionary of all the words to not count for the most commonly used.
dontcount = {"the", "of", "in", "to", "a", "and", "that", "we", "our", "is", "for", "at", "on", "as", "by", "be", "are", "will","this", "with", "or",
"an", "-", "not", "than", "you", "your", "but","it","a","and", "i", "if","they","these","has","been","about","its","his","no"
"because","when","would","was", "have", "their","all","should","from","most", "were","such","he", "very","which","may","because","--------"
"had", "only", "no", "one", "--------", "any", "had", "other", "those", "us", "while",
"..........", "*", "$", "so", "now","what", "who", "my","can", "who","do","could", "over", "-",
"...............","................", "during","make","************",
"......................................................................", "get", "how", "after",
"..................................................", "...........................", "much", "some",
"through","though","therefore","since","many", "then", "there", "–", "both", "them", "well", "me", "even", "also", "however"}
for w in words:
if not w in dontcount:
if w in dictionary:
dictionary[w] +=1
else:
dictionary[w] = 1
num_words = sum(dictionary[w] for w in dictionary)
#This sorts the dictionary and makes it so that the most popular is at the top.
x = [(dictionary[w],w) for w in dictionary]
x.sort()
x.reverse()
#This prints out the number of characters, line, and words(not including stop words.
print(str(filename))
print('The file has ',numchar,' number of characters.')
print('The file has ',numlines,' number of lines.')
print('The file has ',num_words,' number of words.')
#This provides the stucture for how the most common words should be printed out
i = 1
for count, word in x[:ecount]:
print("{0}, {1}, {2}".format(i,count,word))
i+=1