0

I have the following python 3 code for a Flask app. For the app (website), you just upload a TXT or TSV file (containing author information), this is read into memory (since it's small and the app will be deployed to a read-only file system), then the app will format it in a particular way and display the results.

The issue I'm having is that when people upload the file with special characters in it (e.g. accents in authors' names), I get the error:

  File "/Users/cdastmalchi/Desktop/author_script/main.py", line 81, in process_file
    contents = csv.DictReader(file.read().decode('utf-8').splitlines(), delimiter='\t')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 201: invalid start byte

Example line with special characters:

Department of Pathology, Lariboisière Hospital, APHP and Paris Diderot University, Sorbonne Paris

Flask code:

@app.route('/process_file', methods=['POST'])
def process_file():
    # Run checks on the file
    if 'file' not in flask.request.files or not flask.request.files['file'].filename:
        return flask.jsonify({'result':'False', 'message':'no files selected'})
        return flask.redirect(url_for('home'))
    file = flask.request.files['file']
    filename = secure_filename(file.filename)
    if not allowed_file(file.filename):
        return flask.jsonify({'result':'False', 'message':'Must be TXT file!'})
        return flask.redirect(url_for('home'))

    # Stream file and check that places exist
    contents = csv.DictReader(file.read().decode('utf-8').splitlines(), delimiter='\t')
    check_places, json_data = places_exist(contents)

    if check_places is False:
        return flask.jsonify({'result':'False', 'message':'There is an affiliation missing from your Place list. Please re-try.'})
        return flask.redirect(url_for('home'))

    flask.session['filename'] = json_data
    return flask.jsonify({'result':'True'})

Update:

When I do uchardet {file.tsv} (where file.tsv is the test file with the special characters), the output is ISO-8859-9

Update 2:

Here's my attempt at trying to use csv.Sniffer() on a test file with special characters. But I'm not quite sure how to translate this code to work with a file in memory.

import csv

sniff_range = 4096
delimiters = ';\t,'

infile_name = 'unicode.txt'

sniffer = csv.Sniffer()

with open(infile_name, 'r') as infile:
    # Determine dialect
    dialect = sniffer.sniff(
        infile.read(sniff_range), delimiters=delimiters
    )
    infile.seek(0)

    # Sniff for header
    has_header = sniffer.has_header(infile.read(sniff_range))
    infile.seek(0)

    reader = csv.reader(infile, dialect)

    for line in reader:
        print(line)

output:

['Department of Pathology', 'Lariboisière Hospital', 'APHP and Paris Diderot University', 'Sorbonne Paris']

Question: How can I modify my csv.DictReader code to handle these special characters (keeping in mind I can only read the file into memory)?

Update 3:

My question is different from the alleged dupe because I'm trying to figure out the encoding of a file stored in memory, which makes things trickier. I'm trying to implement the following method in my process_file Flask route to determine the encoding, where file in this case is a Flask file storage object (file = flask.request.files['file']). But when I try to print the lines within contents, I get nothing.

file = flask.request.files['file']
result = chardet.detect(file.read())
charenc = result['encoding']

contents = csv.DictReader(file.read().decode(charenc).splitlines(), delimiter='\t')
claudiadast
  • 491
  • 9
  • 28
  • The problem you have is that your code assumes that the file is UTF-8 encoded, but in fact the users of the application may upload a file with any encoding. So using chardet to get the likely encoding before processing is probably the best that you can do. – snakecharmerb Aug 15 '19 at 18:46
  • Possible duplicate of [How to determine the encoding of text?](https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text) – snakecharmerb Aug 15 '19 at 18:47
  • @snakecharmerb: Thanks. That still doesn't help solve the latter part of this issue though, which is HOW to code the part that decodes the file based on it's uchardet result. I'm not sure how to proceed with that part since I have to work with the file from memory. – claudiadast Aug 15 '19 at 18:52
  • What you are doing seems almost fine - before reading the file into the dictreader, use chardet (or some other package) to guess the encoding and replace the hardcoded 'utf-8' with the guessed encoding (like in this answer to the dupe https://stackoverflow.com/a/45167602/5320906). – snakecharmerb Aug 16 '19 at 06:48
  • I would use `uchardet {file}` beforehand on the file, but because it is not saved locally, I can't run a package on it. – claudiadast Aug 20 '19 at 17:16
  • There is a 3rd party python package [chardet](https://pypi.org/project/chardet/) that you can install with `pip`, that guesses encodings. – snakecharmerb Aug 20 '19 at 17:37
  • Please see my update 3 above with using chardet. – claudiadast Aug 20 '19 at 18:39
  • If it is supported, try calling `file.seek(0)` after reading it, to reset the file pointer; if that doesn't work, consider assigning the output of `file.read()` to a variable and passing the variable to `chardet` and `csv.DictReader`. – snakecharmerb Aug 20 '19 at 19:10
  • I tried both of those approaches and am unfortunately still getting the UnicodeDecodeError. It's weird because `chardet` identifies the file's encoding just fine, but when I use that to decode the file upon reading it through `csv.DictReader()`, it doesn't appear to work. – claudiadast Aug 20 '19 at 21:27

1 Answers1

1

This version of your code successfully decodes and prints the file for me.

  @app.route('/process_file', methods=['POST'])
  def process_file():
      # Run checks on the file
      file = flask.request.files['file']
      result = chardet.detect(file.read())
      charenc = result['encoding']

      file.seek(0)
      # Stream file and check that places exist
      reader = csv.DictReader(file.read().decode(charenc).splitlines())
      for row in reader:
          print(row)

      return flask.jsonify({'result': charenc})
snakecharmerb
  • 28,223
  • 10
  • 51
  • 86
  • When I try this on my particular file, I get the error: `'charmap' codec can't decode byte 0x8f in position 4660: character maps to ` (but I think that's just an issue with my file?). Is there a way to get around this? – claudiadast Aug 22 '19 at 16:37