9

I'm looking for a way in Python (2.7) to do HTTP requests with 3 requirements:

  • timeout (for reliability)
  • content maximum size (for security)
  • connection pooling (for performance)

I've checked quite every python HTTP librairies, but none of them meet my requirements. For instance:

urllib2: good, but no pooling

import urllib2
import json

r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100+1)
if len(content) > 100: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100000+1)
if len(content) > 100000: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

requests: no max size

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
r.headers['content-length'] # does not exists for this request, and not safe
content = r.raw.read(100000+1)
print content # ARF this is gzipped, so not the real size
print json.loads(content) # content is gzipped so pretty useless
print r.json() # Does not work anymore since raw.read was used

urllib3: never got the "read" method working, even with a 50Mo file ...

httplib: httplib.HTTPConnection is not a pool (only one connection)

I can hardly belive that urllib2 is the best HTTP library I can use ! So if anyone knows what librairy can do this or how to use one of the previous librairy ...

EDIT:

The best solution I found thanks to Martijn Pieters (StringIO does not slow down even for huge files, where str addition does a lot).

r = requests.get('https://github.com/timeline.json', stream=True)
size = 0
ctt = StringIO()


for chunk in r.iter_content(2048):
    size += len(chunk)
    ctt.write(chunk)
    if size > maxsize:
        r.close()
        raise ValueError('Response too large')

content = ctt.getvalue()
Aurélien Lambert
  • 572
  • 1
  • 5
  • 11

1 Answers1

17

You can do it with requests just fine; but you need to know that the raw object is part of the urllib3 guts and make use of the extra arguments the HTTPResponse.read() call supports, which lets you specify you want to read decoded data:

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

content = r.raw.read(100000+1, decode_content=True)
if len(content) > 100000:
    raise ValueError('Too large a response')
print content
print json.loads(content)

Alternatively, you can set the decode_content flag on the raw object before reading:

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

r.raw.decode_content = True
content = r.raw.read(100000+1)
if len(content) > 100000:
    raise ValueError('Too large a response')
print content
print json.loads(content)

If you don't like reaching into urllib3 guts like that, use the response.iter_content() to iterate over the decoded content in chunks; this uses the underlying HTTPResponse too (using the .stream() generator version:

import requests

r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

maxsize = 100000
content = ''
for chunk in r.iter_content(2048):
    content += chunk
    if len(content) > maxsize:
        r.close()
        raise ValueError('Response too large')

print content
print json.loads(content)

There is of subtle difference here in how compressed data sizes are handled here; r.raw.read(100000+1) will only ever read 100k bytes of compressed data; the uncompressed data is tested against your max size. The iter_content() method will read more uncompressed data in the rare case the compressed stream is larger than the uncompressed data.

Neither method allows r.json() to work; the response._content attribute isn't set by these; you can do so manually of course. But since the .raw.read() and .iter_content() calls already give you access to the content in question, there is really no need.

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
  • Thank you. I tried to compare which method works the best (in particular, which one limits the real size not the downloaded one): `urllib2` does not accept compression, `r.raw.read` compare the gzipped size, and `r.iter_content` compare the real size but really slow down the code (perhaps a stream would make it faster). – Aurélien Lambert May 07 '14 at 12:43
  • @AurélienLambert: how much `r.iter_content()` slows down the code depends entirely on the size of the chunks read; a small chunk size necessitates more loop iterations. And it operates on a stream *already*. – Martijn Pieters May 07 '14 at 12:47
  • The `content += chunk` slow it down due to python str non mutability. StringIO.StringIO solved it. – Aurélien Lambert May 07 '14 at 12:52
  • Yes, I contemplated using a list instead, then `''.join()` at the end, but `StringIO()` encapsulates that nicely. – Martijn Pieters May 07 '14 at 12:53
  • You can't use timeout for a stream. **Documentation: On streaming requests, the timeout only applies to the connection attempt** – Adrian B Jul 08 '14 at 21:12
  • @AdrianB: so? You *can* use `timeout` but it only applies to the connection attempt. – Martijn Pieters Jul 08 '14 at 21:18
  • @MartijnPieters Yes, and for me is a big problem – Adrian B Jul 08 '14 at 21:44
  • @Adrian: did my answer lead you to believe that it did? I'm happy to clarify on that. You can wrap this in a signal handler that cuts off the request if it takes too long still, for example. – Martijn Pieters Jul 08 '14 at 22:22
  • @Adrian: for example: [Timeout function if it takes too long to finish](http://stackoverflow.com/q/2281850) – Martijn Pieters Jul 08 '14 at 22:27
  • 2
    For anyone trying this on Python3 note that you'll need `content = b''` +1 – zx81 Jul 23 '15 at 02:25
  • @AurélienLambert: beware of [gzip bombs](http://stackoverflow.com/q/13622706/4279) -- I don't know whether `decode_content=True` above makes the code susceptible. Unrelated: you can if you wish to [read compressed data using `urllib2` if you read it into memory as in your case](http://stackoverflow.com/a/3947241/4279). [Python 3 code allows to stream gzipped content](http://stackoverflow.com/a/26435241/4279). – jfs Sep 30 '15 at 01:57
  • @AdrianB: the only non-async.io portable [http library that limits the total connection and read timeouts (that I know of) is `pycurl` (that has horrendous API)](http://stackoverflow.com/q/9548869/4279). The alternative is to close the connection using `Timer()` e.g., if `r` is `urllib.request.urlopen()` response then `Timer(timeout, r.fp.raw._sock.shutdown, [socket.SHUT_RDWR])` can enforce the total read timeout (if various `.close()` methods were idempotent here then there would be a less ugly without reaching into guts way to implement the timeout). – jfs Sep 30 '15 at 02:15
  • @J.F.Sebastian: `decode_content=True` allows for the exact same decompression handling as would be used for the `response.content` or `response.text` properties (loading the whole content as one binary or Unicode string). All decompression is handled in urllib3 in either case, no protection against a decompression bomb is included in that. – Martijn Pieters Sep 30 '15 at 07:44
  • If the response is gzipped, does it unpack the content block by block? – daisy May 09 '17 at 05:54
  • @daisy: yes, compressed content is decompressed as you stream. – Martijn Pieters May 09 '17 at 06:24
  • @MartijnPieters: Saving chunk content as bytes to `content = b''` will still eat up memory. The `StringIO` option is an interesting alternative, but requires an additional module import. Probably easier to add up the chunk length? So `size = 0` instead of `content = b''`, and then `size += len(chunk)` instead of `content += chunk` and then check `if size > maxsize`. Could also initially check `if int(r.headers.get('Content-Length')) > maxsize`, in which case you don't have to download chunks at all if Content-Length is actually set. – kregus Dec 04 '17 at 14:13
  • @kregus: `StringIO` will require the same amount of memory, it won't flush to a disk. You are free to handle the chunks as they come in instead of storing them in memory; you could write them to disk for example. Chunked responses do not always have a Content-Length header, one of the reasons this question was posted in the first place (*does not exists for this request, and not safe*. – Martijn Pieters Dec 04 '17 at 14:28