I'm processing HTML files in a local directory that originated from a website, doing my development in Notepad++ on Windows 10. These files claim to be 'utf-8' but are heavy with script code in them. When writing to a file, I can get \u#### codes and \x## codes and garbage characters but not the complete human code. Mostly the \u2019 codes aren't being converted, but a handful of others are being left out too.
with open(self.srcFilename, 'r', encoding='utf8') as f:
self.rawContent = f.read()
f.close()
soup = BeautifulSoup(self.rawContent, 'lxml')
:::: <<<=== other tag processing code
for section in soup.find('article'):
nextNode = section
if soup.find('article').find('p'):
::: <<<=== code to walk through tags
if tag_name == "p":
storytags.append(nextNode.text)
::: <<<=== conditions to end loop
i=1
for line in storytags:
print("[line %d] %s" % (i, line))
logger.write("[line %d] %s\n" % (i, line))
i+=1
setattr(self, 'chapterContent', storytags)
Without the utf-8
encoding, I get the error
File "C:\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 52120: character maps to <undefined>
So the file read is using utf-8
encoding. If I do a console print, from the above section it prints readably/legibly(?). However, writing to a file gives me garbage characters, like They’ve
instead of They've
, and “Let’s
instead of "Let's
.
After a lot of reading, the closest I've come to getting human-readable output is to change my write() statement but I'm still left with stray codes.
(1) logger.write("[line %d] %s\n" % (i, line.encode('unicode_escape').decode()))
(2) logger.write("[line %d] %s\n" % (i, line.encode().decode('utf-8)))
The first statement gives me text, but also \u#### codes and a few \xa0 codes too. The second statement generates an HTML file with text I can read in an HTML browser, but \u2019
still doesn't get interpreted by the Calibre epub builder correctly. I tried using this question/solution but it doesn't recognize the \u code.
Is there a possible fix or are there some pointers for how to get a better handle on my problem might be?
EDIT: Forgot to add, I'm writing to with open('log.txt', 'w+'):
. I was previously using encoding='utf-8'
but that seemed to make it worse.