27

I am trying to store a variable length list of string to a HDF5 Dataset. The code for this is

import h5py
h5File=h5py.File('xxx.h5','w')
strList=['asas','asas','asas']  
h5File.create_dataset('xxx',(len(strList),1),'S10',strList)
h5File.flush() 
h5File.Close()  

I am getting an error stating that "TypeError: No conversion path for dtype: dtype('&lt U3')" where the &lt means actual less than symbol
How can I solve this problem.

gman
  • 1,132
  • 2
  • 13
  • 28
  • For starters, you have a typo on `create_dataset`. Can you give the exact code you're using, especially where `strList` is coming from? – SlightlyCuban Apr 22 '14 at 14:11
  • sorry about the typo, I am trying to serialize a pandas data frame to a HDF5 file so I have to create a header that contains the names of the all columns so I extracted the column names in a list and trying to write it to a HDF5 dataset. – gman Apr 22 '14 at 14:22
  • except for the typo above code emulates exactly similar situation – gman Apr 22 '14 at 14:25
  • You should probably edit your question and fix the typo. – SlightlyCuban Apr 22 '14 at 15:14

3 Answers3

31

You're reading in Unicode strings, but specifying your datatype as ASCII. According to the h5py wiki, h5py does not currently support this conversion.

You'll need to encode the strings in a format h5py handles:

asciiList = [n.encode("ascii", "ignore") for n in strList]
h5File.create_dataset('xxx', (len(asciiList),1),'S10', asciiList)

Note: not everything encoded in UTF-8 can be encoded in ASCII!

SlightlyCuban
  • 2,925
  • 1
  • 18
  • 30
  • What's the proper way to re-extract these strings from the hdf5 file (in python3)? – DilithiumMatrix Jun 22 '16 at 03:46
  • @DilithiumMatrix ASCII is also valid UTF-8, but you can use `ascii.decode('utf-8')` if you need `str` type. Note: my answer will drop non-ASCII characters. If you preserved them with `encode('unicode_escape')`, then you need `decode('unicode_escape')` to convert things back. – SlightlyCuban Jun 22 '16 at 16:53
  • @DilithiumMatrix. To re-extract the code : stringlist=np.array(f['xxx']) vals=[str(el).strip('[]').strip('\'') for el in stringlist.astype(str)] – user3018476 Aug 30 '20 at 02:56
14

In HDF5, data in VL format is stored as arbitrary-length vectors of a base type. In particular, strings are stored C-style in null-terminated buffers. NumPy has no native mechanism to support this. Unfortunately, this is the de facto standard for representing strings in the HDF5 C API, and in many HDF5 applications.

Thankfully, NumPy has a generic pointer type in the form of the “object” (“O”) dtype. In h5py, variable-length strings are mapped to object arrays. A small amount of metadata attached to an “O” dtype tells h5py that its contents should be converted to VL strings when stored in the file.

Existing VL strings can be read and written to with no additional effort; Python strings and fixed-length NumPy strings can be auto-converted to VL data and stored.

Example

In [27]: dt = h5py.special_dtype(vlen=str)

In [28]: dset = h5File.create_dataset('vlen_str', (100,), dtype=dt)

In [29]: dset[0] = 'the change of water into water vapour'

In [30]: dset[0]
Out[30]: 'the change of water into water vapour'
yardstick17
  • 3,297
  • 1
  • 21
  • 27
5

I am in a similar situation wanting to store column names of dataframe as a dataset in hdf5 file. Assuming df.columns is what I want to store, I found the following works:

h5File = h5py.File('my_file.h5','w')
h5File['col_names'] = df.columns.values.astype('S')

This assumes the column names are 'simple' strings that can be encoded in ASCII.