3

I currently use following code to decompress gzipped response by urllib2:

opener = urllib2.build_opener()
response = opener.open(req)
data = response.read()
if response.headers.get('content-encoding', '') == 'gzip':
    data = StringIO.StringIO(data)
    gzipper = gzip.GzipFile(fileobj=data)
    html = gzipper.read()

Does it handle deflated response too or do I need to write seperate code to handle deflated response?

jack
  • 15,121
  • 32
  • 94
  • 122
  • 3
    an HTTP server should not send a compressed response unless the client asks for it with the Accept-Encoding: header. So you shouldn't have to deal with either – Knio Dec 08 '09 at 02:21
  • in this case, i have added req.add_header('Accept-Encoding', 'gzip,deflate') before above code. However, if I dont speicfy "Accept-Encoding" header, sometimes urllib2 will return binary data from a text/html url and cannot be printed on screen. So are you sure ALL http servers wont send a compressed response without "Accept-Encoding" header? – jack Dec 08 '09 at 05:24
  • urllib2 automatically adds Accept-Encoding: gzip, deflate when creating a default request object, so it's not the servers fault (no idea how to turn this off, though) – Alex Lehmann Sep 27 '10 at 15:51
  • If possible, you should remove `deflate` from `Accept-Encoding`. See my comments on this in an answer here: http://stackoverflow.com/questions/9170338/why-are-major-web-sites-using-gzip . If you must accept `deflate`, then you will need to try decoding both possible encodings, zlib and raw deflate. – Mark Adler Aug 05 '12 at 00:29

4 Answers4

4

There is a better way outlined at:

The author explains how to decompress chunk by chunk, rather than all at once in memory. This is the preferred method when larger files are involved.

Also found this helpful site for testing:

Gringo Suave
  • 25,443
  • 6
  • 77
  • 69
4

You can try

if response.headers.get('content-encoding', '') == 'deflate':
    html = zlib.decompress(response.read())

if fail, here is another way, I found it in requests source code,

if response.headers.get('content-encoding', '') == 'deflate':
    html = zlib.decompressobj(-zlib.MAX_WBITS).decompress(response.read())
muzuiget
  • 1,027
  • 1
  • 10
  • 11
1

To answer from above comment, the HTTP spec (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.3) says:

If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding. In this case, if "identity" is one of the available content-codings, then the server SHOULD use the "identity" content-coding, unless it has additional information that a different content-coding is meaningful to the client.

I take that to mean it should use identity. I've never seen a server that doesn't.

Knio
  • 5,369
  • 3
  • 24
  • 27
  • It will work only if the server says "deflate" and delivers "zlib". "zlib" != "deflate". See the SO thread that Alex Martelli quotes. – John Machin Dec 08 '09 at 06:39
  • I did some more testing and you're right, it doesn't work on all servers. However there is no such thing as a "zlib" encoding, and deflate *is* the zlib algorithm, it just needs a proper header or something – Knio Dec 08 '09 at 07:50
1

you can see the code in urllib3

class DeflateDecoder(object):

    def __init__(self):
        self._first_try = True
        self._data = binary_type()
        self._obj = zlib.decompressobj()

    def __getattr__(self, name):
        return getattr(self._obj, name)

    def decompress(self, data):
        if not data:
            return data

        if not self._first_try:
            return self._obj.decompress(data)

        self._data += data
        try:
            return self._obj.decompress(data)
        except zlib.error:
            self._first_try = False
            self._obj = zlib.decompressobj(-zlib.MAX_WBITS)
            try:
                return self.decompress(self._data)
            finally:
                self._data = None


class GzipDecoder(object):

    def __init__(self):
        self._obj = zlib.decompressobj(16 + zlib.MAX_WBITS)

    def __getattr__(self, name):
        return getattr(self._obj, name)

    def decompress(self, data):
        if not data:
            return data
        return self._obj.decompress(data)
leiqin
  • 86
  • 3