1

What is the best way to distribute a task across a dataset that uses a relatively expensive-to-create resource or object for the computation.

# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))

# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...

I plan on using this with dask_jobqueue with SGECluster.

cjlovering
  • 57
  • 1
  • 4

1 Answers1

1
foo = dask.delayed(Foo)()  # create your expensive thing on the workers instead of locally

def do(row, foo):
    return foo.do(row)

df.apply(do, foo=foo)  # include it as an explicit argument, not a closure within a lambda
MRocklin
  • 48,441
  • 20
  • 124
  • 196