5

Is there a way to tell Pandas to use a specific pickle protocol (e.g. 4) when writing an HDF5 file?

Here is the situation (much simplified):

  • Client A is using python=3.8.1 (as well as pandas=1.0.0 and pytables=3.6.1). A writes some DataFrame using df.to_hdf(file, key).

  • Client B is using python=3.7.1 (and, as it happened, pandas=0.25.1 and pytables=3.5.2 --but that's irrelevant). B tries to read the data written by A using pd.read_hdf(file, key), and fails with ValueError: unsupported pickle protocol: 5.

Mind you, this doesn't happen with a purely numerical DataFrame (e.g. pd.DataFrame(np.random.normal(size=(10,10))). So here is a reproducible example:

(base) $ conda activate py38
(py38) $ python
Python 3.8.1 (default, Jan  8 2020, 22:29:32)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame(['hello', 'world']))
>>> df.to_hdf('foo', 'x')
>>> exit()
(py38) $ conda deactivate
(base) $ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_hdf('foo', 'x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 407, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 782, in select
    return it.get_result()
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 1639, in get_result
    results = self.func(self.start, self.stop, where)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 766, in func
    return s.read(start=_start, stop=_stop, where=_where, columns=columns)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 3206, in read
    "block{idx}_values".format(idx=i), start=_start, stop=_stop
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2737, in read_array
    ret = node[0][start:stop]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 681, in __getitem__
    return self.read(start, stop, step)[0]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in read
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in <listcomp>
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/atom.py", line 1227, in fromarray
    return six.moves.cPickle.loads(array.tostring())
ValueError: unsupported pickle protocol: 5
>>>

Note: I tried also reading using pandas=1.0.0 (and pytables=3.6.1) in python=3.7.4. That fails too, so I believe it is simply the Python version (3.8 writer vs 3.7 reader) that causes the problem. This makes sense since pickle protocol 5 was introduced as PEP-574 for Python 3.8.

Pierre D
  • 13,780
  • 6
  • 42
  • 72

3 Answers3

4

Update: I was wrong to assume this was not possible. In fact, based on the excellent "monkey-patch" suggestion of @PiotrJurkiewicz, here is a simple context manager that lets us temporarily change the highest pickle protocol. It:

  1. Hides the monkey-patching, and
  2. Has no side-effect outside of the context; it can be used at any time, whether pickle was previously imported or not, before or after pandas, no matter.

Here is the code (e.g. in a file pickle_prot.py):

import importlib
import pickle


class PickleProtocol:
    def __init__(self, level):
        self.previous = pickle.HIGHEST_PROTOCOL
        self.level = level

    def __enter__(self):
        importlib.reload(pickle)
        pickle.HIGHEST_PROTOCOL = self.level

    def __exit__(self, *exc):
        importlib.reload(pickle)
        pickle.HIGHEST_PROTOCOL = self.previous


def pickle_protocol(level):
    return PickleProtocol(level)

Usage example in a writer:

import pandas as pd
from pickle_prot import pickle_protocol


pd.DataFrame(['hello', 'world']).to_hdf('foo_0.h5', 'x')

with pickle_protocol(4):
    pd.DataFrame(['hello', 'world']).to_hdf('foo_1.h5', 'x')

pd.DataFrame(['hello', 'world']).to_hdf('foo_2.h5', 'x')

And, using a simple test reader:

import pandas as pd
from glob import glob

for filename in sorted(glob('foo_*.h5')):
    try:
        df = pd.read_hdf(filename, 'x')
        print(f'could read {filename}')
    except Exception as e:
        print(f'failed on {filename}: {e}')

Now, trying to read in py37 after having written in py38, we get:

failed on foo_0.h5: unsupported pickle protocol: 5
could read foo_1.h5
failed on foo_2.h5: unsupported pickle protocol: 5

But, using the same version (37 or 38) to read and write, we of course get no exception.

Note: the issue 33087 is still on Pandas issue tracker.

Pierre D
  • 13,780
  • 6
  • 42
  • 72
  • on a related note: [this SO answer](https://stackoverflow.com/a/65152562/758174) shows how to _find out what pickle protocol_ (if any) was used by a Pandas-written HDF5 file. – Pierre D Dec 05 '20 at 12:39
3

PyTable uses the highest protocol by default, which is hardcoded here: https://github.com/PyTables/PyTables/blob/50dc721ab50b56e494a5657e9c8da71776e9f358/tables/atom.py#L1216

As a workaround, you can monkey-patch the pickle module on the client A who writes a HDF file. You should do that before importing pandas:

import pickle
pickle.HIGHEST_PROTOCOL = 4
import pandas

df.to_hdf(file, key)

Now the HDF file has been created using pickle protocol version 4 instead version 5.

Piotr Jurkiewicz
  • 1,324
  • 18
  • 23
  • This is brilliant! Thank you very much. In fact, I figured a way to hide the "monkey-ness" of it. I'm amending my original (and wrong) answer. – Pierre D Apr 03 '20 at 01:13
0

I'm (was) facing the same problem... I "know" how to solve it and I think you do too... The solution is to reprocess the whole data to a pickle (or csv) and re-transform it in python3.7 to a hdf5 (which only knows protocol 4).

the flow is something like this: python3.8 -> hdf5 -> python3.8 -> csv/pickle -> python3.7 -> hdf5 (compatible with both versions)

I avoided this route because I have chuncks of data of a dataframe being exported, creating a large number of files.

Are you actually limited to use python3.7 ? I was limited by tensorflow which as of now only supports up to 3.7 (officially) but you can install tensorflow- nightly-build and it works with python 3.8

Check if you can make the move to 3.8 that would surely solve your problem. :)

André Carvalho
  • 215
  • 1
  • 2
  • 7
  • not an option with many clients working on heterogeneous platforms and various Python versions, using a common distributed filesystem and a large number of datasets. Further, I would like to avoid duplicating the datasets (one with protocol 5, the other 4)... Right now it is a blocking issue for us to allow clients to start using py38. – Pierre D Mar 27 '20 at 23:31