Howdy folks,
I'm new to getting data from the web using python. I'd like to have the source code of this page in a string: https://projects.fivethirtyeight.com/2018-nba-predictions/
The following code has worked for other pages (such as https://www.basketball-reference.com/boxscores/201712090ATL.html):
import urllib.request
file = urllib.request.urlopen(webAddress)
data = file.read()
file.close()
dataString = data.decode(encoding='UTF-8')
And I'd expect dataString to be a string of HTML (see below for my expectations in this specific case)
<!DOCTYPE html><html lang="en"><head><meta property="article:modified_time" etc etc
Instead, for the 538 website, I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
My research has suggested that the problem is that my file isn't actually encoded using UTF-8, but both the page's charset and beautiful-soup's UnicodeDammit() claims it's UTF-8 (the second might be because of the first). chardet.detect() doesn't suggest any encoding. I've tried substituting the following for 'UTF-8' in the encoding parameter of decode() to no avail:
ISO-8859-1
latin-1
Windows-1252
Perhaps worth mentioning is that the byte array data doesn't look like I'd expect it to. Here's data[:10] from a working URL:
b'\n<!DOCTYPE'
Here's data[:10] from the 538 site:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
What's up?