Determining the encoding of a file uploaded to Google App Engine

Question

I have a website based on GAE and Python, and I'd like the user to be able to upload a text file for processing. My implementation is based on standard code from the docs (see http://code.google.com/appengine/docs/python/blobstore/overview.html) and my text file upload handler essentially looks like this:

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info.key())
        for line in blob_reader:
            line = line.rstrip().decode('cp1252')
            do_something(line)
        blob_reader.close()

This works fine for a text file encoded with Code Page 1252, which is what you get when using Windows Notepad and saving with what it calls an "ANSI" encoding. But if you use this handler with a file that has been saved with Notepad's UTF-8 encoding, and contains, say, some Cyrillic characters or a u-umlaut, you'll end up with gibberish. For such a file, changing decode('cp1252') to decode('utf_8') will do the trick. (Well, there's also the possibility of a byte order mark (BOM) at the beginning, but that's easily stripped away.)

But how do you know which decoding to use? The BOM isn't guaranteed to be there, and I don't see any other way to know, other than to ask the user—who probably doesn't know either. Is there a reliable method for determining the encoding? I don't necessarily have to use the blobstore if some other means solves it.

And then there's the encoding that Windows Notepad calls "Unicode" which is a UTF-16 little endian encoding. I could find no decoding (including "utf_16_le") that correctly decodes a file saved with this encoding. Can one of these files be read?

score 3 · Accepted Answer · edited May 23 '17 at 11:56

3

May be this will help: Python: Is there a way to determine the encoding of text file?.

edited May 23 '17 at 11:56

Community

1
1

answered Jan 22 '12 at 09:17

demalexx

4,255
1
27
34

I don't know how I missed this during my pre-post search, but in any case I'm grateful to know about chardet. I've taken what I've learned and posted my own answer to my question, which should be helpful to others facing the same problem. – Dragonfly Jan 23 '12 at 09:20

Dragonfly · Answer 2 · 2012-01-23T18:50:47.620

Following the response from demalexx, my upload handler now determines the encoding using chardet (http://pypi.python.org/pypi/chardet) which, from what I can tell, works extremely well. Along the way I've discovered that using "for line in blob_reader" to read uploaded text files is extremely troublesome. Instead, if you don't mind reading your entire file in one gulp the solution is easy. (Note the stripping away of one BOM sequence, and the splitting of lines across CR/LF.)

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        text = blobstore.BlobReader(blob_info.key()).read()
        encoding = chardet.detect(text)['encoding']
        if encoding is not None:
            for line in text.decode(encoding).lstrip(u'\ufeff').split(u'\x0d\x0a'):
                do_something(line)

If you want to read piecemeal from your uploaded file, you're in for a world of pain. The problem is that "for line in blob_reader" apparently reads up to where a line-feed (\x0a) byte is found, which is disastrous when reading a utf_16_le encoded file as it chops a \x0a\x00 sequence in half!

I don't recommend it, but here's an upload handler that will successfully process files stored by all the encodings in Windows 7 Notepad (namely, ANSI, UTF-8, Unicode and Unicode big endian) a line at a time. As you can see, stripping away the line termination sequences is cumbersome.

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info.key())
        encoding = chardet.detect(blob_reader.read(10000))['encoding']
        if encoding is not None:
            blob_reader.seek(0)
            for line in blob_reader:
                if line[:2] in ['\xff\xfe','\xfe\xff']:
                    start = 2
                elif line[:3] == '\xef\xbb\xbf':
                    start = 3
                else:
                    start = 0
                if encoding == 'UTF-16BE':
                    if line[-4:] == '\x00\x0d\x00\x0a':
                        line = line[start:-4]
                    elif start > 0:
                        line = line[start:]
                elif encoding == 'UTF-16LE':
                    if line[start] == '\x00':
                        start += 1
                    if line[-3:] == '\x0d\x00\x0a':
                        line = line[start:-3]
                    elif start > 0:
                        line = line[start:]
                elif line[-2:] == '\x0d\x0a':
                    line = line[start:-2]
                elif start > 0:
                    line = line[start:]
                do_something(line.decode(encoding))

This is undoubtedly brittle, and my tests have been restricted to those four encodings, and only for how Windows 7 Notepad creates files. Note that before reading a line at a time I'm grabbing up to 10000 characters for chardet to analyze. That's only a guess as to how many bytes it might need. This clumsy double-read is another reason to avoid this solution.

If you don't mind reading the whole blob into memory, why use blobstore upload in the first place? Why not just upload directly to your handler? — Nick Johnson, Jan 23 '12 at 23:00
Also, please do file a bug against BlobReader. It's not currently unicode-aware, and it should be - if you specify the correct encoding when opening it, it ought to be able to find newlines correctly. — Nick Johnson, Jan 23 '12 at 23:03
I would be happy to use a simpler system -- I just don't know how! I followed what I found in the docs and figured I was doing it right. Can you point me to a doc that describes the way you're suggesting, or post some sample code? Much appreciated! (I'll file that bug.) — Dragonfly, Jan 24 '12 at 03:38
Just upload using a regular multipart form, and handle the data as you would in any other webapp. — Nick Johnson, Jan 24 '12 at 04:59
My "easy" solution has a major flaw: it works with small files only. For large files you run the risk of exceeding the 128MB soft memory limit for default instances. Even backend instances are limited to 1GB. Uploads of around 3GB would not be unusual for me, with no fixed upper limit, so reading the entire file into memory is not a solution. It seems I must return to my clumsier implementation, at least until BlobReader is encoding aware. But this is a poor solution, as don't know the full range of possible encodings, and I must be prepared for any. Anyone able to shed more light on this? — Dragonfly, Feb 24 '12 at 18:47
If you've got no metadata that tells you what encoding the data is in, that's a problem regardless of how you get the data. Python has a module to guess encodings (I can't recall its name), but that's far from ideal. — Nick Johnson, Feb 26 '12 at 03:20

Determining the encoding of a file uploaded to Google App Engine

2 Answers2