I have a website based on GAE and Python, and I'd like the user to be able to upload a text file for processing. My implementation is based on standard code from the docs (see http://code.google.com/appengine/docs/python/blobstore/overview.html) and my text file upload handler essentially looks like this:
class Uploader(blobstore_handlers.BlobstoreUploadHandler):
def post(self):
upload_files = self.get_uploads('file')
blob_info = upload_files[0]
blob_reader = blobstore.BlobReader(blob_info.key())
for line in blob_reader:
line = line.rstrip().decode('cp1252')
do_something(line)
blob_reader.close()
This works fine for a text file encoded with Code Page 1252, which is what you get when using Windows Notepad and saving with what it calls an "ANSI" encoding. But if you use this handler with a file that has been saved with Notepad's UTF-8 encoding, and contains, say, some Cyrillic characters or a u-umlaut, you'll end up with gibberish. For such a file, changing decode('cp1252') to decode('utf_8') will do the trick. (Well, there's also the possibility of a byte order mark (BOM) at the beginning, but that's easily stripped away.)
But how do you know which decoding to use? The BOM isn't guaranteed to be there, and I don't see any other way to know, other than to ask the user—who probably doesn't know either. Is there a reliable method for determining the encoding? I don't necessarily have to use the blobstore if some other means solves it.
And then there's the encoding that Windows Notepad calls "Unicode" which is a UTF-16 little endian encoding. I could find no decoding (including "utf_16_le") that correctly decodes a file saved with this encoding. Can one of these files be read?