How does the GIL handle chunked I/O read/write?

Question

Say I had a io.BytesIO() I wanted to write a response to sitting on a thread:

f = io.ByteIO()
with requests.Session() as s:
    r = s.get(url, stream = True)
    for chunk in r.iter_content(chunk_size = 1024):
        f.write(chunk)

Now this is not to harddisk but rather in memory (got plenty of it for my purpose), so I don't have to worry about the needle being a bottleneck. I know for blocking I/O (file read/write) the GIL is released from the docs and this SO post by Alex Martelli, but I wonder, does the GIL just release on f.write() and then reacquire on the __next__() call of the loop?

So what I end up with are a bunch of fast GIL acquisitions and releases. Obviously I would have to time this to determine anything worth note, but does writing to in memory file objects on a multithreaded web scraper in general support GIL bypass?

If not, I'll just handle the large responses and dump them into a queue and process on __main__.

How long does that `f.write()` take? I suspect for writing 1 KiB it's below a microsecond, mostly the function call overhead. I'd understand if you wrote a huge amount, hundreds of megs; maybe at that time something else could be scheduled. — 9000, May 01 '18 at 21:24
@9000 Doesn't take long, just curious about the inner workings. I'm chunking this to be kind to the network (which has low bandwidth) and since a `session.get()` method without `stream` downloads the entire response, when that response is >1GB I want a clean manner to handle it. Figured storing in memory files then passing those back to a central thread (`__main__`) for parsing before harddisk saving would be wise. — pstatix, May 01 '18 at 21:35
The GIL essentially goes by CPython bytecode. So the GIL is released when it gets to the C function underneath `BytesIO.write`. When that C function returns, it's reacquired to execute the next bytecode in the calling function. It may continue to be held all the way through the next `__next__`, but that's not guaranteed. — abarnert, May 01 '18 at 21:36
If you want to see the bytecode for your own code, `dis.dis('''''') will show you. After the `CALL_FUNCTION`, most likely the only other things that happen are a `POP_TOP` to ignore the return value and a `JUMP_ABSOLUTE` back to the `FOR_ITER`. Then, inside that `FOR_ITER` it's doing the special method lookup and call to `__next__` on your iterator in C, and the next bytecode where it could get released is the start of that `iter_content.__next__` (or, if `iter_content` is a generator function, the next instruction after the `YIELD_VALUE`). — abarnert, May 01 '18 at 21:40
@abarnert So when the program gets the C function under `__next__()`, does it not execute the C function under `BytesIO.write()` and then return back to the C function that was under `__next__()`? Thus the `for` could completely bypass the GIL. I would think this could be the case, and the code within a loop would be the real culprit subject to GIL-ness. — pstatix, May 01 '18 at 21:40
By "the C function that was under `__next__`, do you mean `generator.__next__`? Are you asking whether that function can release the GIL, before resuming the generator? (I'm assuming the relevant `requests` object uses a generator function to implement `iter_content`, but you'd want to check that, of course…) — abarnert, May 01 '18 at 21:44
@abarnert I've got near zero C experience (mostly C++), so the implementation details may be a bit fuzzy for me. I mean whatever function in C calls `generator.__next__` (yes, `iter_content` is a generator). What I am getting as is that if the GIL gets to the C that is under the `write()`, surely its already at the C that is under the `__next__` meaning no more bytecode until exiting the `for` right? Shouldn't the GIL be able to release during a `__next__()`? — pstatix, May 01 '18 at 21:47
Calling `write` is not the last bytecode in the loop. It still takes at least two more bytecodes (the `POP_TOP` and `JUMP_ABSOLUTE` mentioned above) to finish the loop. And it also needs to be holding the GIL to start processing the `FOR_ITER`. Inside the part of the ceval loop that handles `FOR_ITER`, it calls a function that does the special method lookup on the iterator, sees that it's a C function, and calls it, and that's how you get into `generator.__next__`. Each of those three bytecodes is a pass through the `main_loop:` in `ceval`. — abarnert, May 01 '18 at 21:53
So, there are three chances to release the GIL because of a `gil_drop_request` from another thread, but, oversimplifying, it only does that 1/N times through the loop or when certain special cases are triggered. So it's _likely_ that it will hold the GIL the entire time from acquiring it after the `write` until getting into `generator.__next__`, but it's not guaranteed. — abarnert, May 01 '18 at 21:55
But anyway, if you think the `BytesIO.write` is the slow part, why are you asking whether the GIL gets released again between the `write` and the `__next__`? — abarnert, May 01 '18 at 21:57
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/170193/discussion-between-pstatix-and-abarnert). — pstatix, May 02 '18 at 00:42
@abarnert I don think any part is slow. I was asking in terms of design, since I/O doesnt acquire the GIL, that if I chose to do a chunked I/O (as shown) vs waiting for the entire response body to "download" then call a `with open() as f: f.write(content)` (which would not require the GIL according the the docs), would be smarter or working harder. — pstatix, May 02 '18 at 00:46

score 0 · Accepted Answer · answered May 02 '18 at 02:35

From what I can see in the BytesIO type's source code, the GIL is not released during a call to BytesIO.write, since it's just doing a quick memory copy. It's only for system calls that may block that it makes sense for the GIL to be released.

There probably is such a syscall in the __next__ method of the r.iter_content generator (when data is read from a socket), but there's none on the writing side.

But I think your question reflects an incorrect understanding of what it means for a builtin function to release the GIL when doing a blocking operation. It will release the GIL just before it does the potentially blocking syscall. But it will reacquire the GIL it before it returns to Python code. So it doesn't matter how many such GIL releasing operations you have in a loop, all the Python code involved will be run with the GIL held. The GIL is never released by one operation and reclaimed by different one. It's both released and reclaimed for each operation, as a single self-contained step.

As an example, you can look at the C code that implements writing to a file descriptor. The macro Py_BEGIN_ALLOW_THREADS releases the GIL. A few lines later, Py_END_ALLOW_THREADS reacquires the GIL. No Python level runs in between those steps, only a few low-level C assignments regarding errno, and the write syscall that might block, waiting on the disk.

I think I need to try some C projects so I can discern just what is going on in that source code. Thanks for your further detail on GIL acquisition, very helpful. Based upon this, chunking a network I/O response to store in memory doesn't seem efficient. Since only a single thread can execute the `for chunk in r.iter_content()` at any time, if 4 threads were live with large files, there would be lots of context switching right? — pstatix, May 02 '18 at 13:44
Only one thread could be doing the `f.write(chunk)` part of the loop, but if the most time spent is waiting on the network to download the next chunk in the generator's `__next__` method (something that will probably release the GIL), then you might still benefit from parallelism. I'd agree that the chunking part is not likely to improve performance, since there's more Python-level code that gets run that way. — Blckknght, May 02 '18 at 22:18

How does the GIL handle chunked I/O read/write?

1 Answers1

Linked