17

If you were writing a bioinformatics algorithm in Haskell, you'd probably use an algebraic data type to represent the nucleotides:

data Nucleotide = A | T | C | G

You'd do similarly in Standard ML or OCaml, I assume (I've never really used either).

A value of type Nucleotide can clearly be contained in two bits. However, doing so would cause access times to be slower than if you used one byte per Nucleotide value, as you'd need to select out the two bits of interest using binary operators.

There is therefore an inherent tradeoff that the compiler must make between memory efficiency and computational efficiency when deciding how to represent algebraic data types. Furthermore, the representation of algebraic data types in memory is made more complicated by the fact that the value can be of variable size:

data Maybe a = Just a | Nothing

Clearly, a Maybe a value of the form Just a is logically larger than a value of the form Nothing. In an extreme example like this:

data Hulk a b c d e = Big a b c d e | Little

you definitely wouldn't want to have to store in a Little value null pointers or zero values for the five values contained in Big values. I'm assuming that you'd just use heap-allocated memory of variable size, with a constructor ID at the beginning (for example, 0 for Big and 1 for Little). However, if you wanted to store Hulk values on the stack (a faster representation), you'd have to store the blank memory along with Little values so that all values of the type Hulk are the same size. Another tradeoff.

Simon Marlow answered my general question in regard to GHC in a previous StackOverflow question. However, I have three related questions that remain unanswered:

  • Do Standard ML (SML/NJ and MLton) and OCaml use the same technique?
  • If so, do any less common compilers of these languages (or their siblings) experiment with other techniques?
  • Is there a reasonably easy way (ideally a pragma or option flag) in these languages to use a more memory efficient representation, like the two-bit representation of Nucleotide? Such memory efficiency is necessary for many bioinformatics applications; if each Nucleotide had to be one byte, high-performance bioinformatics algorithms would have to resort to manual bit-fiddling.
Mike
  • 1,509
  • 2
  • 14
  • 20
  • For haskell, you can check with options to GHC like [`-ddump-asm`](https://gist.github.com/bheklilr/2fdb1b4b640c9fa02e19) or `-ddump-simpl` to view how it's stored at a lower level. Basically, for your simple example, each tag appears to be represented as a `long`, but there's some meta data in there that I'm not really sure what it's doing either. The basic gist is that each constructor gets turned into a closure, then those get combined to form the data type's closure. – bheklilr Jul 18 '14 at 02:23
  • 8
    You certainly will not get a more definitive answer (or answerer) about GHC than Simon Marlow's in the linked question. Since this is the de-facto standard Haskell implementation, perhaps you should make your question specific to another language -- or perhaps we can close this one as a duplicate of that one. What do you think? – Daniel Wagner Jul 18 '14 at 03:08
  • @DanielWagner: I guess my current question isn't fully answered by that, as I asked about SML and OCaml as well. I'll reword it to ask about general techniques, and memory-efficient implementations. – Mike Jul 18 '14 at 04:05
  • 2
    Data representation is not likely to be something you can easily make blanket statements about, as the specifications of high-level functional languages leave a lot of room for implementations to vary. Someone could speak definitively about OCaml, but SML has a variety of implementations. However, my guess is that *none* of these languages will automatically give you the compact representation you want with just a pragma or flag, as they tend towards heap allocation and uniform representation due to parametric polymorphism. – Levi Pearson Jul 18 '14 at 15:36
  • @LeviPearson: Good point. I specified which SML compilers I was thinking of: SML/NJ and MLton. I'm not aware of any others that are commonly used. – Mike Jul 18 '14 at 21:57
  • Isn't the two bit representation of Nucleotide actually faster for many memory heavy algorithms because the CPU causes fewer cache misses? – Jeff Burdges Feb 17 '15 at 16:35

1 Answers1

2

There's no single answer: data types are abstract structures and can be implemented in a variety of ways at the implementor's discretion. In practice considerations such as separate compilation tend to constrain things somewhat.

For the specific case of packing a data type containing only nullary constructors into as few bits as possible, you could proceed by defining functions from data type to small integer and back. An integral type hidden by an abstract type (or in Haskell, newtype) would also be a reasonable choice. Packing and unpacking the small integers into whatever aggregate form you are working with would be your job.

By the way, Real World OCaml has a very nice chapter on the representation of OCaml values (rough summary: not hugely different from GHC for the purposes of this question).

gsg
  • 8,645
  • 1
  • 18
  • 23
  • OCaml values are not hugely different when you stay in the common subset, but so-called “polymorphic variants”, that are not in the common subset, are noteworthy in themselves. – Pascal Cuoq Jul 19 '14 at 09:04
  • Indeed, there's also objects. I don't think these constructs have much bearing on the OPs question, so I'll reword to suggest that there is more to the subject. – gsg Jul 19 '14 at 09:56