REVISED WITH COMMENTS v1: Multiprocessing on same dict/list

Question

I am fairly new to python, kindly excuse me for insufficient information if any. As a part of the curriculum , I got introduced to python for quants/finance, I am studying multiprocessing and trying to understand this better. I tried modifying the problem given and now I am stuck mentally with the problem.

Problem:

I have a function which gives me ticks, in ohlc format.

{'scrip_name':'ABC','timestamp':1504836192,'open':301.05,'high':303.80,'low':299.00,'close':301.10,'volume':100000}

every minute. I wish to do the following calculation concurrently and preferably append/insert in the samelist

Find the Moving Average of the last 5 close data
Find the Median of the last 5 open data
Save the tick data to a database.

so expected data is likely to be

['scrip_name':'ABC','timestamp':1504836192,'open':301.05,'high':303.80,'low':299.00,'close':301.10,'volume':100000,'MA_5_open':300.25,'Median_5_close':300.50]

Assuming that the data is going to a db, its fairly easy to write a simple dbinsert routine to the database, I don't see that as a great challenge, I can spawn a to execute a insert statement for every minute.

How do I sync 3 different functions/process( a function to insert into db, a function to calculate the average, a function to calculate the median), while holding in memory 5 ticks to calculate the 5 period, simple average Moving Average and push them back to the dict/list.

The following assumption, challenges me in writing the multiprocessing routine. can someone guide me. I don't want to use pandas dataframe.

====REVISION/UPDATE===

The reason, why I don't want any solution on pandas/numpy is that, my objective is to understand the basics, and not the nuances of a new library. Please don't mistake my need for understanding to be arrogance or not wanting to be open to suggestions.

The advantage of having

p1=Process(target=Median,arg(sourcelist))
p2=Process(target=Average,arg(sourcelist))
p3=process(target=insertdb,arg(updatedlist))

would help me understand the possibility of scaling processes based on no of functions /algo components.. But how should I make sure p1&p2 are in sync while p3 should execute post p1&p2

hello also for future questions, i've found this helpful : https://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks — jmunsch, Sep 08 '17 at 02:50
To share a dictionary, list, array, etc, you need to use a [`multiprocessing.managers.SyncManager`](https://docs.python.org/2/library/multiprocessing.html#multiprocessing.managers.SyncManager) (which is a subclass of [`multiprocessing.managers.BaseManager`](https://docs.python.org/2/library/multiprocessing.html#multiprocessing.managers.BaseManager)) and the proxies for those types it supports. That said, performance is usually better using `multiprocessing.Queue` (which doesn't need a `Manager`). — martineau, Sep 08 '17 at 02:50
Martineau, you are right regarding the queue, its likely to be faster, but sometimes it may not serve in all efficiency.lets assume i have a queue_db , so i have a function:dbinsert to read and update db, it will be ok. what would happen, if in the future i intend to distribute the workload of different algo results on the samelist as thats the instant of data for that minute. so i have 60 seconds to solve possibly 20 algos and update it back to the same list, before proceeding to the next update — Suresh, Sep 08 '17 at 02:57

jmunsch · Answer 1 · 2017-09-08T02:48:24.413

0

Here is an example of how to use multiprocessing:

from multiprocessing import Pool, cpu_count
def db_func(ma, med):
    db.save(something)

def backtest_strat(d, db_func):
    a = d.get('avg')
    s = map(sum, a)
    db_func(s/len(a), median(a))

with Pool(cpu_count()) as p:
    from functools import partial
    bs = partial(backtest_strat, db_func=db_func)
    print(p.map(bs, [{'avg': [1,2,3,4,5], 'median': [1,2,3,4,5]}]))

also see :

https://stackoverflow.com/a/24101655/2026508

note that this will not speed up anything unless there are a lot of slices.

so for the speed up part:

def get_slices(data)
    for slice in data:
        yield {'avg': [1,2,3,4,5], 'median': [1,2,3,4,5]}

p.map(bs, get_slices)

from what i understand multiprocessing works by message passing via pickles, so the pool.map when called should have access to all three things, the two arrays, and the db_save function. There are of course other ways to go about it, but hopefully this shows one way to go about it.

edited Sep 08 '17 at 02:48

answered Sep 08 '17 at 02:33

jmunsch

16,405
6
74
87

Thanx for a quick response, but this will not help. ooops am new here, so having trouble responding , kindly bear with me – Suresh Sep 08 '17 at 02:50
The example of Median or Average is an example, it could be a much complicated function... so basically a pool of process wouldn't be ideal, as the ability of some functions will be faster than the rest. – Suresh Sep 08 '17 at 02:54
kindly advice, what if functions are separate for median and average , how should I synchronize and maintain integrity – Suresh Sep 08 '17 at 03:38
Hi jmunsch, Sorry was away on personal time, I will try to run your code and try understanding how it works. I have just started to read about partials, will keep you posted on my improved level of understanding :) at the end of this exercise. Thanx. – Suresh Sep 10 '17 at 10:50

score -1 · Answer 2 · answered Sep 11 '17 at 13:16

Question: how should I make sure p1&p2 are in sync while p3 should execute post p1&p2

If you sync all Processes, computing one Task (p1,p2,p3) couldn't be faster as the slowes Process are be. In the meantime the other Processes running idle.

It's called "Producer - Consumer Problem".
Solution using Queue all Data serialize, no synchronize required.

# Process-1
def Producer()
    task_queue.put(data)

# Process-2
def Consumer(task_queue)
    data = task_queue.get()
    # process data

You want multiple Consumer Processes and one Consumer Process gather all Results.
You don't want to use Queue, but Sync Primitives.
This Example let all Processes run independent.
Only the Process Result waits until notified.

This Example uses a unlimited Task Buffer tasks = mp.Manager().list().
The Size could be minimized if List Entrys for done Tasks are reused.
If you have some very fast algos combine some to one Process.

import multiprocessing as mp

# Base class for all WORKERS
class Worker(mp.Process):
    tasks = mp.Manager().list()
    task_ready = mp.Condition()
    parties = mp.Manager().Value(int, 0)

    @classmethod
    def join(self):
        # Wait until all Data processed

    def get_task(self):
        for i, task in enumerate(Worker.tasks):
            if task is None: continue
            if not self.__class__.__name__ in task['result']:
                return (i, task['range'])
        return (None, None)

    # Main Process Loop
    def run(self):
        while True:
            # Get a Task for this WORKER
            idx, _range = self.get_task()
            if idx is None:
                break
            # Compute with self Method this _range
            result = self.compute(_range)

            # Update Worker.tasks
            with Worker.lock:
                task = Worker.tasks[idx]
                task['result'][name] = result
                parties = len(task['result'])
                Worker.tasks[idx] = task

            # If Last, notify Process Result
            if parties == Worker.parties.value:
                with Worker.task_ready:
                    Worker.task_ready.notify()

class Result(Worker):
    # Main Process Loop
    def run(self):
        while True:
            with Worker.task_ready:
                Worker.task_ready.wait()

            # Get (idx, _range) from tasks List
            idx, _range = self.get_task()
            if idx is None:
                break

            # process Task Results

            # Mark this tasks List Entry as done for reuse
            Worker.tasks[idx] = None

class Average(Worker):
    def compute(self, _range):
        return average of DATA[_range]

class Median(Worker):
    def compute(self, _range):
        return median of DATA[_range]

if __name__ == '__main__':
    DATA = mp.Manager().list()
    WORKERS = [Result(), Average(), Median()]
    Worker.start(WORKERS)

    # Example creates a Task every 5 Records
    for i in range(1, 16):
        DATA.append({'id': i, 'open': 300 + randrange(0, 5), 'close': 300 + randrange(-5, 5)})
        if i % 5 == 0:
            Worker.tasks.append({'range':(i-5, i), 'result': {}})

    Worker.join()

Tested with Python: 3.4.2

REVISED WITH COMMENTS v1: Multiprocessing on same dict/list

2 Answers2