0

I'm trying to speed up some code that calls an api_caller(), which is a generator that you can iterate over to get results.

My synchronous code looks something like this:

def process_comment_tree(p):
    # time consuming breadth first search that makes another api call...
    return

def process_post(p):
    process_comment_tree(p)

def process_posts(kw):
    for p in api_caller(query=kw): #possibly 1000s of results
        process_post(p)
    
def process_kws(kws):
    for kw in kws:
        process_posts(kw)

process_kws(kws=['python', 'threads', 'music'])

When I run this code on a long list of kws, it takes around 18 minutes to complete.

When I use threads:

with concurrent.futures.ThreadPoolExecutor(max_workers=len(KWS)) as pool:
    for result in pool.map(process_posts, ['python', 'threads', 'music']):
        print(f'result: {result}')

the code completes in around 3 minutes.

Now, I'm trying to use Trio for the first time, but I'm having trouble.

async def process_comment_tree(p):
    # same as before...
    return

async def process_post(p):
    await process_comment_tree(p)

async def process_posts(kw):
    async with trio.open_nursery() as nursery:
        for p in r.api.search_submissions(query=kw)
            nursery.start_soon(process_post, p)
    
async def process_kws(kws):
    async with trio.open_nursery() as nursery:
        for kw in kws:
            nursery.start_soon(process_posts, kw)
trio.run(process_kws, ['python', 'threads', 'music'])

This still takes around 18 minutes to execute. Am I doing something wrong here, or is something like trio/async not appropriate for my problem setup?

theQman
  • 1,423
  • 4
  • 24
  • 38
  • This `r.api.search_submissions` seems to be a synchronous call, which blocks so you only have one running at a time. – lilydjwg Jun 27 '20 at 16:06
  • Yes, that seems to be the case. Is there no way around this without digging into the external api? Why is the threaded version so much faster? – theQman Jun 27 '20 at 17:29

1 Answers1

0

Trio, and async libraries in general, work by switching to a different task while waiting for something external, like an API call. In your code example, it looks like you start a bunch of tasks, but wait for something external. I would recommend reading this part of the tutorial; it gives an idea of what that means: https://trio.readthedocs.io/en/stable/tutorial.html#task-switching-illustrated

Basically, your code has to call a function that will pass control back to the run loop so that it can switch to a different task.

If your api_caller generator makes calls to an external API, that's likely to be something you can replace with async calls. You'll need to use an async http library, like HTTPX or hip

On the other hand, if there's nothing in your code that has to wait for something external, then async won't help your code go faster.

  • `api_caller` is part of a third party library. Are you saying I would need to edit that code to get this to work? Why is the threaded version so much faster? – theQman Jun 27 '20 at 17:32
  • You can make it work with Trio using Trio's `run_sync` function: https://trio.readthedocs.io/en/stable/reference-core.html#threads-if-you-must The difference between Trio and threads is that threads switch without you asking them to. This can make it difficult to deal with race conditions (see https://stackoverflow.com/questions/4024056/threads-vs-async for other points). Essentially, in Trio, you have to tell Trio "I'm waiting for something, go ahead and run something else now," whereas with threads it's constantly switching back and forth anyway, so you don't have to tell it you're waiting. – Harrison Morgan Jun 27 '20 at 18:24