2

I have a file descriptor, and I would like to read from it with multiple tasks. Each read() request on the fd is going to return a full, independent packet of data (as long as data is available).

My naive implementation was to have each worker run the following loop:

async def work_loop(fd):
   while True:
     await trio.hazmat.wait_readable(fd)
     buf = os.read(fd, BUFSIZE)
     if not buf:
         break
     await do_work(buf)

Unfortunately, this does not work because trio raises ResourceBusyError if multiple tasks are blocking on the same fd. So my next iteration was to write a custom wait function:

async def work_loop(fd):
   while True:
     await my_wait_readable(fd)
     buf = os.read(fd, BUFSIZE)
     if not buf:
         break
     await do_work(buf)

where

read_queue = trio.hazmat.ParkingLot()
async def my_wait_readable():
    if name is None:
        name = trio.hazmat.current_task().name
    while True:
        try:
            log.debug('%s: Waiting for fd to become readable...', name)
            await trio.hazmat.wait_readable(fd)
        except trio.ResourceBusyError:
            log.debug('%s: Resource busy, parking in read queue.', name)
            await read_queue.park()
            continue
        log.debug('%s: fd readable, unparking next task.', name)
        read_queue.unpark()
        break

However, in tests I get og messages like these:

2018-09-18 13:09:17.219 pyfuse3-worker-37: Waiting for fd to become readable...
2018-09-18 13:09:17.219 pyfuse3-worker-47: Waiting for fd to become readable...
2018-09-18 13:09:17.220 pyfuse3-worker-53: Waiting for fd to become readable...
2018-09-18 13:09:17.220 pyfuse3-worker-51: fd readable, unparking next task.
2018-09-18 13:09:17.220 pyfuse3-worker-51: doing work
2018-09-18 13:09:17.221 pyfuse3-worker-47: Resource busy, parking in read queue.
2018-09-18 13:09:17.221 pyfuse3-worker-37: Resource busy, parking in read queue.
2018-09-18 13:09:17.221 pyfuse3-worker-53: Resource busy, parking in read queue.

In other words:

  1. All tasks enter trio.hazmat.wait_readable
  2. One tasks returns successfully and tries to unpark the next task (but there is none)
  3. The other tasks receive BusyError and park themselves
  4. Nothing happens anymore, since all workers are parked

What's the proper way to solve this problem?

Nikratio
  • 2,138
  • 2
  • 25
  • 40
  • Rethink your approach. Read [SO:about-using-multiprocessing-to-read-file](https://stackoverflow.com/questions/46741567/about-using-multiprocessing-to-read-file) and [read-large-file-in-parallel](https://stackoverflow.com/questions/18104481/read-large-file-in-parallel) – stovfl Sep 18 '18 at 13:11

1 Answers1

3

Multiple readers from the same fd don't make sense, using Trio (or not) doesn't change that basic fact. Why are you trying to do that in the first place?

If for some reason you really do need parallel multiple tasks to post-process your data, use one read task to add the data to a queue and let your processing tasks get their data from that.

Alternately, you could use a lock:

read_lock = trio.Lock()
async def work_loop(fd):
   while True:
     async with read_lock:
        await trio.hazmat.wait_readable(fd)
        buf = os.read(fd, BUFSIZE)
     if not buf:
         break
     await do_work(buf)
Matthias Urlichs
  • 1,789
  • 14
  • 26
  • I don't follow. To me, turning this argument around it sounds just as plausible: "Using one task just to read the data and add it to a queue with multiple readers doesn't make sense. If you really need parallel multiple tasks, let them each read from the fd directly". Could you elaborate on *why* my approach doesn't make sense? It certainly seems simpler, as it avoids the extra task and the extra queue. – Nikratio Sep 19 '18 at 16:46
  • It doesn't make sense because (a) reading from the fd is not the time critical part so there's no reason multiple tasks should do it, (b) waiting-for-readability and then reading is not atomic, thus letting more than one task do it requires locking. In fact, I'll amend my answer to add a lock to your code, which should also fix the problem. – Matthias Urlichs Sep 20 '18 at 05:37
  • The reason to read from multiple threads is because the threads are already there anyway - why would I want to introduce yet another one and an extra queue? The solution with the lock works nicely though, thanks! – Nikratio Sep 20 '18 at 09:09
  • Another task (not thread) and a queue isn't any more or less expensive than a lock, so whatever fits your code structure best. Anyway, you might want to use a queue if you have a read backlog otherwise and want to keep more data in the pipeline – a Unix pipe only buffers one page. Or if your data has some structure that needs to be reassembled. Or if the data connection might break and needs to be re-established, which is often easier if you have a task that you can simply restart. – Matthias Urlichs Sep 20 '18 at 13:23