2

I'm processing HTML files in a local directory that originated from a website, doing my development in Notepad++ on Windows 10. These files claim to be 'utf-8' but are heavy with script code in them. When writing to a file, I can get \u#### codes and \x## codes and garbage characters but not the complete human code. Mostly the \u2019 codes aren't being converted, but a handful of others are being left out too.

with open(self.srcFilename, 'r', encoding='utf8') as f:
        self.rawContent = f.read()
        f.close()                    
soup = BeautifulSoup(self.rawContent, 'lxml')
:::: <<<=== other tag processing code
for section in soup.find('article'):
            nextNode = section           
            if soup.find('article').find('p'):
                ::: <<<=== code to walk through tags
                if tag_name == "p":
                    storytags.append(nextNode.text)                        
                ::: <<<=== conditions to end loop        
i=1
for line in storytags:
    print("[line %d] %s" % (i, line))
    logger.write("[line %d] %s\n" % (i, line))
    i+=1
setattr(self, 'chapterContent', storytags)    

Without the utf-8 encoding, I get the error

File "C:\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 52120: character maps to <undefined>

So the file read is using utf-8 encoding. If I do a console print, from the above section it prints readably/legibly(?). However, writing to a file gives me garbage characters, like They’ve instead of They've, and “Let’s instead of "Let's.

After a lot of reading, the closest I've come to getting human-readable output is to change my write() statement but I'm still left with stray codes.

(1) logger.write("[line %d] %s\n" % (i, line.encode('unicode_escape').decode()))
(2) logger.write("[line %d] %s\n" % (i, line.encode().decode('utf-8)))

The first statement gives me text, but also \u#### codes and a few \xa0 codes too. The second statement generates an HTML file with text I can read in an HTML browser, but \u2019 still doesn't get interpreted by the Calibre epub builder correctly. I tried using this question/solution but it doesn't recognize the \u code.

Is there a possible fix or are there some pointers for how to get a better handle on my problem might be?

EDIT: Forgot to add, I'm writing to with open('log.txt', 'w+'):. I was previously using encoding='utf-8' but that seemed to make it worse.

Meghan M.
  • 63
  • 9
  • Not sure how anyone can help you given the above description. If you're working with files that claim to have 'utf-8' encoding but might not, then you need to figure out how you want to handle that(correct the source files, handle invalid encoding in some way...etc). But without access to the files it would be difficult for anyone to recommend a solution. – user2263572 Oct 22 '18 at 00:45
  • Was hoping to get some suggestions for encode/decode with `line` to help with debugging. Something like `for c in line: print("%s, ord(%d)" % (c, ord(c))` with some more likely encode/decode variations. I'm at best an infrequent programmer and newish to python. – Meghan M. Oct 22 '18 at 02:14

1 Answers1

1

A week searching around and finally found the answer after posting here, Removing unicode \u2026 like characters in a string in python2.7. Btw, I'm working in Python 3.6 so it's not related to the python version.

with open(output, 'w+') as out:
    ::: <<<=== code
    line = line.encode('utf-8').decode('ascii','ignore')`
    out.write(line)

I still need to work through variations of (output, 'w+') with and without encoding. Anyway... this finally gave me the best results.

Meghan M.
  • 63
  • 9
  • Another useful article for debugging purposes regarding encoding issues. https://stackoverflow.com/questions/13106175/how-to-find-out-number-name-of-unicode-character-in-python – Meghan M. Oct 22 '18 at 20:11