8

I am deserializing large numpy arrays (500MB in this example) and I find the results vary by orders of magnitude between approaches. Below are the 3 approaches I've timed.

I'm receiving the data from the multiprocessing.shared_memory package, so the data comes to me as a memoryview object. But in these simple examples, I just pre-create a byte array to run the test.

I wonder if there are any mistakes in these approaches, or if there are other techniques I didn't try. Deserialization in Python is a real pickle of a problem if you want to move data fast and not lock the GIL just for the IO. A good explanation as to why these approaches vary so much would also be a good answer.

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

Results:

Deserialize using pickle...
Time: 0.2509949207 sec
Deserialize from bytes...
Time: 0.0204288960 sec
Deserialize using numpy load from BytesIO...
Time: 28.9850852489 sec

The second option is the fastest, but notably less elegant because I need to explicitly serialize the shape and dtype information.

David Parks
  • 25,796
  • 41
  • 148
  • 265
  • 3
    The pun in that quote D: – Mateen Ulhaq Jun 12 '20 at 21:25
  • `pickle` of a numpy array uses the `np.save` format - a small buffer with shape, etc info followed by a byte copy of the array's data buffer. I would expect pickle load from file to take about the same time as a np.load`. I haven't experiemented with the bytestring equivalent(s). – hpaulj Jun 12 '20 at 21:28
  • 2
    The `BytesIO` load is slow because `numpy.load` has a [slow path](https://github.com/numpy/numpy/blob/6ee49178517088966e63c2aedf6a8a5779ad5384/numpy/lib/format.py#L739) if the file-like object isn't a real file. It looks like the slow path is a workaround for a GZIP file bug. I wonder if it's still necessary. – user2357112 supports Monica Jun 12 '20 at 22:05

1 Answers1

2

I found your question useful, I'm looking for best numpy serialization and confirmed that np.load() was best except it was beaten by pyarrow in my add on test below. Arrow is now a super popular data serialization framework for distributed compute (E.g. Spark, ...)

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
import pyarrow as pa


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
pa_buf = pa.serialize(sample).to_buffer()

serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize pyarrow')
t0 = time.time()
restored_data = pa.deserialize(pa_buf)
print('Time: {:.10f} sec'.format(time.time() - t0))

Results from i3.2xlarge on Databricks Runtime 8.3ML Python 3.8, Numpy 1.19.2, Pyarrow 1.0.1

Deserialize using pickle...
Time: 0.4069395065 sec
Deserialize from bytes...
Time: 0.0281322002 sec
Deserialize using numpy load from BytesIO...
Time: 0.3059172630 sec
Deserialize pyarrow
Time: 0.0031735897 sec

Your BytesIO results were about 100x more than mine, which I don't know why.

Douglas Moore
  • 812
  • 5
  • 15
  • note that time.time() is not recommended for performance measurement; at least consider `%timeit`. – anon01 May 23 '21 at 22:54