Why can't I decode this UTF-8 page?

Question

Howdy folks,

I'm new to getting data from the web using python. I'd like to have the source code of this page in a string: https://projects.fivethirtyeight.com/2018-nba-predictions/

The following code has worked for other pages (such as https://www.basketball-reference.com/boxscores/201712090ATL.html):

import urllib.request
file = urllib.request.urlopen(webAddress)
data = file.read()
file.close()
dataString = data.decode(encoding='UTF-8')

And I'd expect dataString to be a string of HTML (see below for my expectations in this specific case)

<!DOCTYPE html><html lang="en"><head><meta property="article:modified_time" etc etc

Instead, for the 538 website, I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

My research has suggested that the problem is that my file isn't actually encoded using UTF-8, but both the page's charset and beautiful-soup's UnicodeDammit() claims it's UTF-8 (the second might be because of the first). chardet.detect() doesn't suggest any encoding. I've tried substituting the following for 'UTF-8' in the encoding parameter of decode() to no avail:

ISO-8859-1

latin-1

Windows-1252

Perhaps worth mentioning is that the byte array data doesn't look like I'd expect it to. Here's data[:10] from a working URL:

b'\n<!DOCTYPE'

Here's data[:10] from the 538 site:

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

What's up?

Grabbing the data with `wget` provides a gzip-compressed file, which, uncompressed, provides a regular UTF-8 HTML page; probably the server is badly configured, and provides a compressed page without setting the relevant headers. — Matteo Italia, Dec 10 '17 at 16:13
@Ryan: indeed it does seem to set `gzip` as `content-encoding`, but neither `curl` nor `wget` do anything about that, which is strange, as usually they handle transparently transport-level compression... there must be something strange in how this server behaves. — Matteo Italia, Dec 10 '17 at 16:20
@matteoitalia using wget has indeed showing me that it's gzip-compressed. This is unfamiliar territory for me using python, but it's enough progress that I'm confident exploring further. Thanks!! — Andy Pollino, Dec 10 '17 at 16:21
@AndyPollino: looking further into this, it seems that `curl` (without `--compressed`), `wget` and `urllib` (in general) don't handle gzipped content automatically, thus they don't set the corresponding accept-encoding request header, but the server provides gzipped content anyhow. Seems like you'll have to handle it by yourself. OTOH, the great `requests` library does handle the whole thing by itself. — Matteo Italia, Dec 10 '17 at 16:26

Matteo Italia · Accepted Answer · 2017-12-10T16:37:54.863

The server provided you with gzip-compressed data; this is not completely common, as urllib by default doesn't set any accept-encoding value, so servers generally conservatively don't compress the data.

Still, the content-encoding field of the response is set, so you have the way to know that your page is indeed gzip-compressed, and you can decompress it using Python gzip module before further processing.

import urllib.request
import gzip
file = urllib.request.urlopen(webAddress)
data = file.read()
if file.headers['content-encoding'].lower() == 'gzip':
    data = gzip.decompress(data)
file.close()
dataString = data.decode(encoding='UTF-8')

OTOH, if you have the possibility to use the requests module it will handle all this mess by itself, including compression (did I mention that you may also get deflate besides gzip, which is the same but with different headers?) and (at least partially) encoding.

import requests
webAddress = "https://projects.fivethirtyeight.com/2018-nba-predictions/"
r = requests.get(webAddress)
print(repr(r.text))

This will perform your request and correctly print out the already-decoded Unicode string.

score 2 · Answer 2 · answered Dec 10 '17 at 16:14

2

You are reading gzipped data: http://www.forensicswiki.org/wiki/Gzip You have to decompress it.

answered Dec 10 '17 at 16:14

Ned Batchelder

323,515
67
518
625

Why can't I decode this UTF-8 page?

2 Answers2