0

I have a small function (see below) that returns a list of names that are mapped from a list of integers (eg [1,2,3,4]) which can be of length up to a thousand.

This function can potentially get called tens of thousands of times at a time and I want to know if I can do anything to make it run faster.

The graph_hash is a large hash that maps keys to sets of length 1000 or less. I am iterating over a set and mapping the values to names and returning a list. The u.get_name_from_id() queries an sqlite database.

Any thoughts to optimize any part of this function?

def get_neighbors(pid):
    names = []
    for p in graph_hash[pid]:
        names.append(u.get_name_from_id(p))
    return names
Nurjan
  • 5,184
  • 5
  • 30
  • 47
user2399453
  • 2,378
  • 2
  • 22
  • 49
  • 1
    You could make it a generator, though that would just push the bottleneck to the calling function – OneCricketeer Oct 31 '16 at 03:40
  • 1
    cache as much as you can `get_neighbors` and `u.get_name_from_id` – salparadise Oct 31 '16 at 03:45
  • 1
    Can you construct a ```WHERE x IN [all the items in the set]``` clause and just make one dB query per ```pid```??? If you can and ```get_name_from_id``` returns a list, use ```names.extend(....)``` – wwii Oct 31 '16 at 04:12

4 Answers4

2

You're hitting the database sequentially here:

for p in graph_hash[pid]:
   names.append(u.get_name_from_id(p))

I would recommend doing it concurrently using threads. Something like this should get you started:

def load_stuff(queue, p):
    q.put(u.get_name_from_id(p))

def get_neighbors(pid):
    names = Queue.Queue()
    # we'll keep track of the threads with this list
    threads = []
    for p in graph_hash[pid]:
        thread = threading.Thread(target=load_stuff, args=(names,p))
        threads.append(thread)
        # start the thread
        thread.start()
    # wait for them to finish before you return your Queue
    for thread in threads:
        thread.join()
    return names

You can turn the Queue back into a list with [item for item in names.queue] if needed.

The idea is that the database calls are blocking until they're done, but you can make multiple SELECT statements on a database without locking. So, you should use threads or some other concurrency method to avoid waiting unnecessarily.

rofls
  • 4,587
  • 3
  • 21
  • 33
2

Caching and multithreading are only going to get you so far, you should create a new method that uses executemany under the hood to retrieve multiple names from the database in bulk.

Something like names = u.get_names_from_ids(graph_hash[pid]).

Francisco C
  • 8,871
  • 4
  • 31
  • 41
1

I would recommend to use deque instead of list if you doing thousands of appends. So, names should be names = deque().

Nurjan
  • 5,184
  • 5
  • 30
  • 47
  • That's not right - lists in Python are optimized for efficiently handling an `append()` operation. New lists are _not_ being created, the elements are just being added at the end. – Óscar López Oct 31 '16 at 04:16
  • According to https://wiki.python.org/moin/TimeComplexity: `"Internally, a list is represented as an array; the largest costs come from growing beyond the current allocation size (because everything must move), or from inserting or deleting somewhere near the beginning (because everything after that must move). If you need to add/remove at both ends, consider using a collections.deque instead."` – Nurjan Oct 31 '16 at 04:20
  • The key part here being "at both ends". That's not the case when appending to a list, we're adding elements at the most efficient end. (And anyway, a `deque` also has to grow if adding elements beyond its initial capacity) – Óscar López Oct 31 '16 at 04:22
1

A list comprehension is a start (similar to @cricket_007's generator suggestion), but you are limited by function calls:

def get_neighbors(pid):
  return [u.get_name_from_id(p) for p in graph_hash[pid]]

As @salparadise suggested, consider memoization to speed up get_name_from_id().

pylang
  • 28,402
  • 9
  • 97
  • 94