(Dask) How to distribute expensive resource needed for computation?

Question

What is the best way to distribute a task across a dataset that uses a relatively expensive-to-create resource or object for the computation.

# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))

# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...

I plan on using this with dask_jobqueue with SGECluster.

MRocklin · Accepted Answer · 2018-10-03T20:31:37.740

1

foo = dask.delayed(Foo)()  # create your expensive thing on the workers instead of locally

def do(row, foo):
    return foo.do(row)

df.apply(do, foo=foo)  # include it as an explicit argument, not a closure within a lambda

edited Oct 03 '18 at 20:31

answered Oct 03 '18 at 19:29

MRocklin

48,441
20
124
196

Is there a way to use it in the framework of client.map(do, df)? Or, is this the same thing? – cjlovering Oct 03 '18 at 22:21

(Dask) How to distribute expensive resource needed for computation?

1 Answers1