Can parser combinators be made efficient?

Question

Around 6 years ago, I benchmarked my own parser combinators in OCaml and found that they were ~5× slower than the parser generators on offer at the time. I recently revisited this subject and benchmarked Haskell's Parsec vs a simple hand-rolled precedence climbing parser written in F# and was surprised to find the F# to be 25× faster than the Haskell.

Here's the Haskell code I used to read a large mathematical expression from file, parse and evaluate it:

import Control.Applicative
import Text.Parsec hiding ((<|>))

expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')

term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')

fact = read <$> many1 digit <|> char '(' *> expr <* char ')'

eval :: String -> Int
eval = either (error . show) id . parse expr "" . filter (/= ' ')

main :: IO ()
main = do
    file <- readFile "expr"
    putStr $ show $ eval file
    putStr "\n"

and here's my self-contained precedence climbing parser in F#:

let rec (|Expr|) = function
  | P(f, xs) -> Expr(loop (' ', f, xs))
  | xs -> invalidArg "Expr" (sprintf "%A" xs)
and loop = function
  | ' ' as oop, f, ('+' | '-' as op)::P(g, xs)
  | (' ' | '+' | '-' as oop), f, ('*' | '/' as op)::P(g, xs) ->
      let h, xs = loop (op, g, xs)
      match op with
      | '+' -> (+) | '-' -> (-) | '*' -> (*) | '/' | _ -> (/)
      |> fun op -> loop (oop, op f h, xs)
  | _, f, xs -> f, xs
and (|P|_|) = function
  | '('::Expr(f, ')'::xs) -> Some(P(f, xs))
  | c::_ as xs when '0' <= c && c <= '9' ->
      let rec loop n = function
        | c2::xs when '0' <= c2 && c2 <= '9' -> loop (10*n + int(string c2)) xs
        | xs -> Some(P(n, xs))
      loop 0 xs
  | _ -> None

My impression is that even state-of-the-art parser combinators waste a lot of time back tracking. Is that correct? If so, is it possible to write parser combinators that generate state machines to obtain competitive performance or is it necessary to use code generation?

EDIT:

Here's the OCaml script I used to generate a ~2Mb expression for benchmarking:

open Printf

let rec f ff n =
  if n=0 then fprintf ff "1" else
    fprintf ff "%a+%a*(%a-%a)" f (n-1) f (n-1) f (n-1) f (n-1)

let () =
  let n = try int_of_string Sys.argv.(1) with _ -> 3 in
  fprintf stdout "%a\n" f n

I've always been under the impression that parser combinators were inefficient, but you'd really have to try both solutions in the same language to get a good gauge of the speed difference. — Gabe, Dec 30 '10 at 04:10
it seems you're using parsec 3.x, which according to this is slower than parsec 2 http://www.mail-archive.com/haskell-cafe@haskell.org/msg81686.html — Mauricio Scheffer, Dec 30 '10 at 04:35
An issue is possibly also, that you're using char lists in Haskell. Parsing a Bytestring is faster. — fuz, Dec 30 '10 at 04:43
What's that `~` in the F# code? To be fair, your two parsers must parse exactly the same. — fuz, Dec 30 '10 at 05:28
@FUZxxl "Parsing a bytestring is faster". Actually I'd wager thats probably neither here nor there. Why the original Haskell is slower than the Ocaml is the Haskell is using the full power of parser combinators (backtracking, position handling,...) at the lexical level. The Ocaml code is doing the bare minimum work - essentially 'See a Char, do an action'. — stephen tetley, Dec 30 '10 at 07:59
@FUZxxl: The precedence climber in F# supports raise-to-the-power and unary minus (the `~`). The parser combinators in Haskell do not. — J D, Dec 30 '10 at 14:08
@stephen tetley: Truely. I would really like to see a comparison with comparable algorithms (eg. Both as combinators or both as hand written parser). And also, see djahandarie's solution for that bytestring is faster). — fuz, Dec 30 '10 at 14:19
FUZxxl: I have removed the extra features from the precedence climber in F#. — J D, Dec 30 '10 at 14:37
@FUZxxl, wouldn't it make more sense for the combinator and the hand-written parser to be in the same language than to only do one in both languages? I saw the question as an evaluation of parser combinators, not a language shootout. — Darius Jahandarie, Dec 30 '10 at 14:41
I don't know a lot about F#, but it looks to me like the F# parser doesn't parse multiple-digit numbers - in any case, when I translated it to Haskell my resulting parser didn't. Is that correct, or am I missing something? — mokus, Dec 30 '10 at 15:20
@mokus: that seems to be the case, I was just running it on F# and it fails with multidigit numbers — Mauricio Scheffer, Dec 30 '10 at 16:10
@djahandarie: I'm assuming that the differences between the languages is ≪25× performance difference which, given your results below seems to be a valid assumption. — J D, Dec 30 '10 at 17:26
@mokus: I've updated the F# code to handle multi-digit numbers. The different in performance is insignificant. — J D, Dec 30 '10 at 17:28
@Mauricio: I've updated the question to include the OCaml (!) code I used to generate a big expression to benchmark different parsers. — J D, Dec 30 '10 at 17:31
I'd be curious about a comparison with parsec's built-in precedence-aware expression parser as well. If nobody else does this, maybe I'll take as swing at it next week. — sclv, Dec 31 '10 at 02:11
@sclv: I'd be curious to see a translation of the precedence climber to Haskell. I had a hack but gave up. — J D, Dec 31 '10 at 23:26
@Jon Harrop - I wasn't expecting a difference in performance, I was just translating the code for my own edification and noticed the difference. — mokus, Jan 01 '11 at 13:12

score 62 · Answer 1 · answered Dec 30 '10 at 06:51

62

I've come up with a Haskell solution that is 30× faster than the Haskell solution you posted (with my concocted test expression).

Major changes:

Change Parsec/String to Attoparsec/ByteString
In the fact function, change read & many1 digit to decimal
Made the chainl1 recursion strict (remove $! for the lazier version).

I tried to keep everything else you had as similar as possible.

import Control.Applicative
import Data.Attoparsec
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B

expr :: Parser Int
expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')

term :: Parser Int
term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')

fact :: Parser Int
fact = decimal <|> char '(' *> expr <* char ')'

eval :: B.ByteString -> Int
eval = either (error . show) id . eitherResult . parse expr . B.filter (/= ' ')

chainl1 :: (Monad f, Alternative f) => f a -> f (a -> a -> a) -> f a
chainl1 p op = p >>= rest where
  rest x = do f <- op
              y <- p
              rest $! (f x y)
           <|> pure x

main :: IO ()
main = B.readFile "expr" >>= (print . eval)

I guess what I concluded from this is that the majority of the slowdown for the parser combinator was that it was sitting on an inefficient base, not that it was a parser combinator, per se.

I imagine with more time and profiling this could go faster, as I stopped when I went past the 25× mark.

I don't know if this would be faster than the precedence climbing parser ported to Haskell. Maybe that would be an interesting test?

answered Dec 30 '10 at 06:51

Darius Jahandarie

751
1
4
10

4

Parsec's default behavior of enabling backtracking is another reason why it's often slow. Parser combinator with the opposite behavior (e.g. uu-parsinglib) are often faster and simpler to reason about. – John L Dec 30 '10 at 11:03
7

+1 I can confirm that this is much faster and even faster than the F#. I'll convert the F# to use the equivalent of `ByteString` instead of `char` lists. – J D Dec 30 '10 at 14:38
3

Converting my F# to use ASCII encoded strings (the equivalent of Haskell's `ByteString`, I believe) instead of Unicode makes it several times faster again. Specifically, 1.78s for my hand-rolled parser in F# vs 8.44 for your attoparsec parser in Haskell on a 42Mb file. – J D Dec 31 '10 at 01:18
ByteString holds strings of bytes. That breaks the unicode abstraction but lets you parse utf-8 if you are careful, and if you aren't that careful, then you can parse that subset of unicode which fits in one utf-8 codepoint. Parsec3 can also parse Text (which is full unicode, internally represented as utf-16), though reportedly not as efficiently as, e.g, with a specialized implementation. There's an attoparsec-text in development to provide precisely that. – sclv Dec 31 '10 at 02:09
6

Jon, would you like me to try and make a faster Haskell version? I think that the F# hand-rolled parser vs. the F# parser combinator is a better test (of parser combinators) now that it's possible due to Stephan's implementation. Switching the host language when testing two solutions probably isn't a very scientific way to handle it, but I can spend some more time on this if you'd like to see how fast I can get it. – Darius Jahandarie Dec 31 '10 at 03:58
djahandarie: I'd very much like to see a translation of my hand-rolled parser to Haskell and an efficient Haskell-based solution using parser combinators that handled unicode. I had actually assumed that my original Parsec-based solution would handle unicode correctly. Is that not the case? – J D Jan 01 '11 at 23:44
1

The original parsec solution working on Haskell Strings handles unicode correctly. – sclv Jan 02 '11 at 04:37
@DariusJahandarie Since you wrote your answer, [`attoparsec` has undergone substantial refactoring, and performance has improved](http://www.serpentine.com/blog/2014/05/31/attoparsec/). I'd be curious to know how your parser (compiled with `-O2`) compares to Jon Harrop's F# parser now; unfortunately, I can't run F# on my machine. – jub0bs Jul 03 '15 at 10:47

Stephan Tolksdorf · Accepted Answer · 2011-04-27T18:43:10.973

I'm currently working on the next version of FParsec (v. 0.9), which will in many situations improve performance by up to a factor of 2 relative to the current version.

[Update: FParsec 0.9 has been released, see http://www.quanttec.com/fparsec ]

I've tested Jon's F# parser implementation against two FParsec implementations. The first FParsec parser is a direct translation of djahandarie's parser. The second one uses FParsec's embeddable operator precedence component. As the input I used a string generated with Jon's OCaml script with parameter 10, which gives me an input size of about 2.66MB. All parsers were compiled in release mode and were run on the 32-bit .NET 4 CLR. I only measured the pure parsing time and didn't include startup time or the time needed for constructing the input string (for the FParsec parsers) or the char list (Jon's parser).

I measured the following numbers (updated numbers for v. 0.9 in parens):

Jon's hand-rolled parser: ~230ms
FParsec parser #1: ~270ms (~235ms)
FParsec parser #2: ~110ms (~102ms)

In light of these numbers, I'd say that parser combinators can definitely offer competitive performance, at least for this particular problem, especially if you take into account that FParsec

automatically generates highly readable error messages,
supports very large files as input (with arbitrary backtracking), and
comes with a declarative, runtime-configurable operator-precedence parser module.

Here's the code for the two FParsec implementations:

Parser #1 (Translation of djahandarie's parser):

open FParsec

let str s = pstring s
let expr, exprRef = createParserForwardedToRef()

let fact = pint32 <|> between (str "(") (str ")") expr
let term =   chainl1 fact ((str "*" >>% (*)) <|> (str "/" >>% (/)))
do exprRef:= chainl1 term ((str "+" >>% (+)) <|> (str "-" >>% (-)))

let parse str = run expr str

Parser #2 (Idiomatic FParsec implementation):

open FParsec

let opp = new OperatorPrecedenceParser<_,_,_>()
type Assoc = Associativity

let str s = pstring s
let noWS = preturn () // dummy whitespace parser

opp.AddOperator(InfixOperator("-", noWS, 1, Assoc.Left, (-)))
opp.AddOperator(InfixOperator("+", noWS, 1, Assoc.Left, (+)))
opp.AddOperator(InfixOperator("*", noWS, 2, Assoc.Left, (*)))
opp.AddOperator(InfixOperator("/", noWS, 2, Assoc.Left, (/)))

let expr = opp.ExpressionParser
let term = pint32 <|> between (str "(") (str ")") expr
opp.TermParser <- term

let parse str = run expr str

Very impressive but your use of `CharParsers` makes me think you've dropped support for Unicode. Dropping unicode from my hand-rolled parser makes it another 8× faster than before which is still ~4× faster than your new version of FParsec. That's more in line with my recollection of the benchmarks I did years ago though... — J D, Dec 31 '10 at 00:56
No, FParsec parsers are Unicode parsers. If you see an 8x better performance for the same code when you switch from a string/char array to a byte array, there must be some compiler issue (either in the F# compiler or the CLR's JIT). Even if the parser was memory-bound, it would be hard to explain a difference much larger than 2x. — Stephan Tolksdorf, Dec 31 '10 at 01:07
Wow, that's incredible! When I moved from char lists to strings I had to rearrange the code which meant fewer active patterns. That seems to have been partly responsible for the speedup. I also see a big speedup in the parser depending what encoding was used to load the string... — J D, Dec 31 '10 at 01:57

score 13 · Answer 3 · answered Dec 30 '10 at 08:35

In a nutshell, parser combinators are slow for lexing.

There was a Haskell combinator library for building lexers (see "Lazy Lexing is Fast" Manuel M. T. Chakravarty) - as the tables were generated at runtime, there wasn't the hassle of code generation. The library got used a bit - it was initially used in one of the FFI preprocessors, but I don't think it ever got uploaded to Hackage, so maybe it was a little too inconvenient for regular use.

In the OCaml code above, the parser is directly matching on char-lists so it can be as fast as list destructuring is in the host language (it would be much faster than Parsec if it were re-implemented in Haskell). Christian Lindig had an OCaml library that had a set of parser combinators and a set of lexer combinators - the lexer combinators were certainly much simpler than Manuel Chakravarty's, and it might might be worthwhile tracking down this library and bench-marking it before writing a lexer generator.

Christian Lindig's lexer combinators took a bit of finding - they are the modules lc.ml and lc.mli in the Caml CDK. There are two version of the parser combinators - pc.ml and pc2.ml. — stephen tetley, Dec 30 '10 at 16:08
They have a warning at the top saying they aren't suitable for lexing long (>1k chars) strings. :-( — J D, Jan 02 '11 at 00:01
Found the lexer combinator library mentioned above; it's in Mr Chakravarty's Compiler Toolkit (CTK), also in a subset of this kit called CTKLight. See http://www.cse.unsw.edu.au/~chak/haskell/ctk/ . Thanks for the reference! — Owen S., Apr 08 '13 at 21:00

score 8 · Answer 4 · edited May 29 '15 at 16:44

8

Have you tried one of the known fast parser libraries? Parsec's aims have never really been speed, but ease of use and clarity. Comparing to something like attoparsec may be a more fair comparison, especially because the string types are likely to be more equal (ByteString instead of String).

I also wonder which compile flags were used. This being another trolling post by the infamous Jon Harrop, it would not surprise me if no optimisations were used at all for the Haskell code.

edited May 29 '15 at 16:44

Erik Kaplun

33,421
12
92
102

answered Dec 30 '10 at 04:55

Axman6

770
3
11

Good point. For instance, if you'd used the non-overloaded `()` from `Text.Parsec`, it's a little bit faster and similar. And a hand-made parser is always faster than one mad of combinators. – fuz Dec 30 '10 at 05:27
5

"it would not surprise me is no optimisations were used at all for the haskell code". I used `--make -O2` for the Haskell with GHC 6.12.3 which, I believe, is conventional optimized Haskell. You should be able to reproduce the results easily enough to verify this for yourself. – J D Jan 02 '11 at 00:09
3

@Axman6 I don't see why you'd think Jon's trolling — in any case, this thread's been an entertaining and educating read for me. – Erik Kaplun May 29 '15 at 16:47

Can parser combinators be made efficient?

4 Answers4

Linked