1

I am writing a grid searching utility and am trying to use multiprocessing to speed up calculation. I have an objective function which interacts with a large class which I cannot pickle due to memory constraints (I can only pickle relevant attributes of the class).

import pickle
from multiprocessing import Pool


class TestClass:
    def __init__(self):
        self.param = 10

    def __getstate__(self):
        raise RuntimeError("don't you dare pickle me!")

    def __setstate__(self, state):
        raise RuntimeError("don't you dare pickle me!")

    def loss(self, ext_param):
        return self.param*ext_param


if __name__ == '__main__':
    test_instance = TestClass()

    def objective_function(param):
        return test_instance.loss(param)

    with Pool(4) as p:
        result = p.map(objective_function, range(20))
    print(result)

In the following toy example, I was expecting during pickling of the objective_function, that test_instance would also have to be pickled, thus throwing RuntimeError (due to exception throwing at __getstate__). However this does not happen and the code runs smoothly.

So my question is - what is being pickled here exactly? And if test_instance is not pickled, then how is it reconstructed on individual processes?

Raven
  • 466
  • 1
  • 5
  • 16

2 Answers2

0

On windows+python3.8, I could not run the original code which defined test_instance and objective_function as local variable for main, error as below

    AttributeError: Can't get attribute 'objective_function' on <module '__mp_main' from 'xxx.py'>

I've moved definition of objective_function and initialization of test_instance into global scope, it works well as you mentioned. However from this, it seems that global variables have been initialized again for different process rather than pickled/unpickled.

Finally, I've changed your code as below, and it triggered the error that you expected.

    test_instance1 = TestClass()
    test_instance2 = TestClass()
    with Pool(4) as p:
        result = p.map(objective_function, [test_instance1, test_instance2])
    print(result)

So the prameters in p.map actually do pickled/unpickled.

Wilson.F
  • 39
  • 3
0

Ok, with Wilson's help and some further digging, I've managed to answer my own question. I'll insert the modified code from above to help with explanation:

import pickle
from multiprocessing import Pool, current_process


class TestClass:
    def __init__(self):
        self.param = 0

    def __getstate__(self):
        raise RuntimeError("don't you dare pickle me!")

    def __setstate__(self, state):
        raise RuntimeError("don't you dare pickle me!")

    def loss(self, ext_param):
        self.param += 1
        print(f"{current_process().pid}: {hex(id(self))}:  {self.param}: {ext_param} ")
        return f"{self.param}_{ext_param}"


def objective_function(param):
    return test_instance.loss(param)

if __name__ == '__main__':

    test_instance = TestClass()
    print(hex(id(test_instance)))
    print('objective_function' in globals())  # this returns True on my MacOS+python3.7

    with Pool(2) as p:
        result = p.map(objective_function, range(6))

    print(result)
    print(test_instance.param)

# ---- RUN RESULTS BELOW ----
# 0x7f987b955e48
# True
# 10484: 0x7f987b955e48:  1: 0 
# 10485: 0x7f987b955e48:  1: 1 
# 10484: 0x7f987b955e48:  2: 2 
# 10485: 0x7f987b955e48:  2: 3 
# 10484: 0x7f987b955e48:  3: 4 
# 10485: 0x7f987b955e48:  3: 5 
# ['1_0', '1_1', '2_2', '2_3', '3_4', '3_5']
# 0

As Wilson has correctly hinted, the only thing that gets pickled during p.map are the parameters themselves and not the objective function, however this is not reinitialised but copied, along with the test_instance during os.fork() process which happens somewhere in the Pool initialisation. You can see that as even though inside each process the test_instance.param values are independent of each other they share the same virtual memory as the original instance of the class before the fork (example of different processes sharing the same virtual memory can be seen here).

As per the solution to the initial question, I believe the only way to properly solve this issue is to distribute the necessary parameters in shared memory or memory manager.

Raven
  • 466
  • 1
  • 5
  • 16