11

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

I can explain better with an example:

Original HDF5 file would look something like:

A    B    C    D
1    0    34   11
2    0    32   15
3    1    35   22
4    1    34   15
5    1    31   9
1    0    34   15
2    1    29   11
3    0    34   15
4    1    12   14
5    0    34   15
1    0    32   13
2    1    34   15
etc  etc  etc  etc

What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4

Until now I can just load the whole HDF5 using:

store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])

I do not see how to include a where condition here.

codeKiller
  • 4,342
  • 10
  • 41
  • 90

2 Answers2

11

The hdf5 file must be written in table format (as opposed to fixed format) in order to be queryable with pd.read_hdf's where argument.

Furthermore, A must be declared as a data_column:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

or, to specify all columns as (queryable) data columns:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

Then you could use

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

to select rows where the value column A is 1, 3 or 4. For example,

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

yields

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13

If you have a very long list of values, vals, then you could use string formatting to compose the right where argument:

where='A in {}'.format(vals)
unutbu
  • 711,858
  • 148
  • 1,594
  • 1,547
  • Thanks unutbu, just some comments on this good answer. I understand that, at the begining of your answer you write the df to h5 in table format. However, the input to my script is a h5 already saved, how can I know if it is in the correct format or not? – codeKiller Oct 31 '15 at 14:06
  • If your `h5` file is not in `table` format, then using `pd.read_hdf` with the `where` parameter will raise `TypeError: cannot pass a where specification when reading from a Fixed format...`. If the `h5` file is in `table` format by `A` was not specified as a `data_column`, then you would get `ValueError: The passed where expression: A in [1,3,4] contains an invalid variable reference...`. – unutbu Oct 31 '15 at 14:09
  • I don't know of a quick/easy way to convert an h5 file from `fixed` to `table` format, or to add `data_columns`. As far as I know, you would have to read the entire `h5` file into a DataFrame (or to do so in chunks using the `chunksize` parameter) and then to write it out or append to a different `h5` file in `table` format. – unutbu Oct 31 '15 at 14:14
1

You can do this using pandas.read_hdf (here), with the optional parameter of where.
For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])

Dean Fenster
  • 2,140
  • 1
  • 17
  • 25
  • Thanks Dean, is it possible to include more complex conditions? like for example, if my A column has values from 1 to 100 and I want to select something random like [1,3,11,16,27,33,34,44,41,55,68,70,77,81,90]...I am asking that because in the main question I wanted to make it easy, but in my real case the `where` condition would need to be more complex – codeKiller Oct 31 '15 at 13:28
  • Found the documentation for the Expr you need to pass to where: http://nullege.com/codes/search/pandas.computation.pytables.Expr?fulldoc=1 (This might be helpful, though I don't see an option for filtering by containment) – Dean Fenster Oct 31 '15 at 13:42