46

I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore trying to determine what options I have with Python (besides buying more hardware and memory).

I should clarify that approaches like map-reduce will not help in much of my work because I need to operate on complete sets of data (e.g. computing quantiles or fitting a logistic regression model).

Recently I started playing with h5py and think it is the best option I have found for allowing Python to act like SAS and operate on data from disk (via hdf5 files), while still being able to leverage numpy/scipy/matplotlib, etc. I would like to hear if anyone has experience using Python and h5py in a similar setting and what they have found. Has anyone been able to use Python in "big data" settings heretofore dominated by SAS?

EDIT: Buying more hardware/memory certainly can help, but from an IT perspective it is hard for me to sell Python to an organization that needs to analyze huge data sets when Python (or R, or MATLAB etc) need to hold data in memory. SAS continues to have a strong selling point here because while disk-based analytics may be slower, you can confidently deal with huge data sets. So, I am hoping that Stackoverflow-ers can help me figure out how to reduce the perceived risk around using Python as a mainstay big-data analytics language.

JoshAdel
  • 57,369
  • 23
  • 130
  • 131
Josh Hemann
  • 910
  • 9
  • 12
  • 3
    This is not really a programming question, more of an online dating question. Clearly there are lots of people using H5 and Python because the h5py team has been developing for a number of years. P.S. Python usage in the sciences is growing by leaps and bounds. – Michael Dillon Feb 02 '11 at 09:44
  • 4
    Is the amount of time a library has been in development really an indicator of its use in the setting I am asking about? To be clear, I am already a Python fan and use it for my job in business analytics as well as air pollution modeling at a major university. I am asking about a specific use case: using a language that does in-memory processing to work on huge data sets, not amenable to map-reduce, and traditionally tackled by SAS for decades. – Josh Hemann Feb 02 '11 at 13:03
  • Not a joke, but have you considered to just utilize hardware with enough memory? – eat Feb 02 '11 at 13:39
  • Perhaps the hardware requirements won't be too dramatic if this concept https://github.com/FrancescAlted/carray starts flying! – eat Feb 03 '11 at 19:24
  • 1
    @eat: Interesting link. I have read through various presentations by Mr. Alted and the tools he develops are amazing. Alas, I am looking to keep my standard, numpy-based code in tact as much as possible so I have avoided things like PyTables. It is not clear to me how to use his compression tools in everyday work. Would I load some data, operate on it, compress it to make room to load more data, etc and compress/uncompress as needed? This could help in some settings. – Josh Hemann Feb 04 '11 at 05:07

2 Answers2

49

We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.

HDF5 advantages:

  • data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools
  • APIs are available for different platforms and languages
  • structure data using groups
  • annotating data using attributes
  • worry-free built-in data compression
  • io on single datasets is fast

HDF5 pitfalls:

  • Performance breaks down, if a h5 file contains too many datasets/groups (> 1000), because traversing them is very slow. On the other side, io is fast for a few big datasets.
  • Advanced data queries (SQL like) are clumsy to implement and slow (consider SQLite in that case)
  • HDF5 is not thread-safe in all cases: one has to ensure, that the library was compiled with the correct options
  • changing h5 datasets (resize, delete etc.) blows up the file size (in the best case) or is impossible (in the worst case) (the whole h5 file has to be copied to flatten it again)
Bernhard Kausler
  • 4,698
  • 3
  • 30
  • 33
  • 1
    This is a really helpful answer. I was not aware of h5view. Luckily, I do not see the need to have deeply hierarchical files. But the thread-safety pitfall is an important one because I try to use the multiprocessing package or the parallel extensions in IPython to speed calculations as much as possible. – Josh Hemann Feb 02 '11 at 16:12
  • Can you provide a reference for the first pitfall? It is not listed in the [HDF5 FAQ](http://www.hdfgroup.org/HDF5/faq/perfissues.html), for example. – Brecht Machiels Aug 01 '14 at 07:07
  • Bad performance is based on my personal experience. Maybe breakdown is the wrong word: traversing through thousands of datasets/groups is just much slower than traversing through thousands of slices in a single dataset. – Bernhard Kausler Aug 01 '14 at 12:32
5

This is a long comment, not an answer to your actual question about h5py. I don't use Python for stats and tend to deal with relatively small datasets, but it might be worth a moment to check out the CRAN Task View for high-performance computing in R, especially the "Large memory and out-of-memory data" section.

Three reasons:

  • you can mine the source code of any of those packages for ideas that might help you generally
  • you might find the package names useful in searching for Python equivalents; a lot of R users are Python users, too
  • under some circumstances, it might prove convenient to just link to R for a particular analysis using one of the above-linked packages and then draw the results back into Python

Again, I emphasize that this is all way out of my league, and it's certainly possible that you might already know all of this. But perhaps this will prove useful to you or someone working on the same problems.

Matt Parker
  • 24,639
  • 6
  • 51
  • 71
  • This is good advice. We use R too, but more so Python. Same issue though with respect to in-memory analytics. From the link you sent the ff package seems like the R analog to what I am talking about wit h5py. And of course commercially there is the XDF format supported by Revolution Analytics. But from what I understand it is currently a pretty limited set of functionality focused on regression. – Josh Hemann Feb 03 '11 at 04:04