5

Is it stupid to build a regex based parser?

redDragonzz
  • 1,463
  • 1
  • 15
  • 32
  • 11
    ---- it depends ---- – Mauritz Hansen Mar 22 '11 at 09:42
  • You appear to ask two different questions: are you building a language parser and wish to base your parser on a pile of regex? Or are you trying to implement a competitor to [pcre](http://www.pcre.org/) or [Oniguruma](http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt) regular expression parsers? – sarnold Mar 22 '11 at 09:45
  • Is this question related to [VBScript Partial Parser](http://stackoverflow.com/questions/5192774/vbscript-partial-parser)? Perhaps you can join forces with [Sarah Vessels](http://stackoverflow.com/questions/5382088). – Kobi Mar 22 '11 at 10:28
  • Yes, it is stupid. But it is ok to build a regex-based lexer. – SK-logic Mar 22 '11 at 10:28
  • @sarnold: Am trying to build my own parser based on a pile of regex. @kobi: yup, After having a hard time trying to implement a partial parser for VBScript I think let's give Regex a try. – redDragonzz Mar 22 '11 at 10:44
  • 2
    @redDragonzz, in some sense, PEGs (http://en.wikipedia.org/wiki/Parsing_expression_grammar) can be considered as a generalisation of regular expressions. So if you prefer this way of thinking, you can try one of the existing PEG parser generators to build your parser - it is much easier and much more intuitive than a classic lexer + parser combination. – SK-logic Mar 22 '11 at 10:51
  • As @MauritzHansen said, it depends. Primarily on the language you want to parse and the functional capabilities your regular expression library has. – Gumbo Mar 03 '12 at 09:38
  • Yes, I ended up making my own Regex Parsing engine for VBScript. Worked great for quite a number of scenarios, but for some especially nested declarations, doesn't work the way it is supposed to. – redDragonzz Mar 04 '12 at 13:21

6 Answers6

16

Matching nested parens is exceedingly simple using modern patterns. Not counting whitespace, this sort of thing:

\( (?: [^()] *+ | (?0) )* \)

works for mainstream languages like Perl and PHP, plus anything that uses PCRE.

However, you really need grammatical regexes for a full parse, or you’ll go nuts. Don’t use a language whose regexes don’t support breaking regexes down into smaller units, or which don’t support proper debugging of their compilation and execution. Life’s too short for low-level hackery. Might as well go back to assembly language if you’re going to do that.

I’ve written about recursive patterns, grammatical patterns, and parsing quite a bit: for example, see here for parsing approaches and here for lexer approaches; also, the final solution here.

Also, Perl’s Regexp::Grammars module is especially useful in turning grammatical regexes into parsing structures.

So by all means, go for it. You’ll learn a lot that way.

Ken Williams
  • 19,823
  • 7
  • 71
  • 126
tchrist
  • 74,913
  • 28
  • 118
  • 169
  • 1
    +1 Wish I could give you +100! The excellent cited references are now bookmarked. – ridgerunner Mar 22 '11 at 19:53
  • Although you can somewhat using RegExp for parsing, It remains a stupid idea (can't build syntax tree or make any sense of the input ...) and this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Noctisdark Mar 11 '16 at 18:32
8

For work? Yes. For learning? No.

Matt
  • 3,568
  • 2
  • 23
  • 31
1

The allure of parsing your own little languages with regular expressions cannot be overstated: most sysadmins could write a simple language parser entirely in Perl very quickly, but parsing the same language with lex/yacc would take most programmers a few hours.

And the Perl version would probably just about do the job. But as gpvos points out, using regex backend for your parsing drastically reduces future enhancement options, and sometimes attempts to work around the limitations leads to some pretty awful code, when it would be easy to handle those general enhancements with table-driven tools or hand written recursive descent parsers.

If you know the language is always going to remain easily parse-able with regex, you might do the right thing by spending an hour to get the job done, rather than four or five re-learning lex and yacc enough to write a similar parser with stronger tools. But if the language is liable to grow or change much, using real parser generators will probably help in the long run.

sarnold
  • 96,852
  • 21
  • 162
  • 219
1

It depends on what you want to parse, but IMO for most of the practical cases the answer is "No". Regex are quite limited on the grammar they can recognize (the limits being set by the regex implementation, as everybody put their own spice on it)

As you stated in your comments that you're building a parser for VBScript, forget about regexes as you need to recognize a Context Free Grammar. Check GOLD Parser or ANTLR.

Soronthar
  • 1,611
  • 10
  • 10
  • used Gold Parser, it parses the code, but is very slow while i wish to implement something very fast, and asynchronous that does not take up huge chunk of time. As always I can optimize my usage of GOLD parser, but looks like Regexes might not help to parse all the stuff, they still help you to find function declarations etc which is what I need more than parsing an entire source code using GOLD Parser. – redDragonzz Mar 23 '11 at 05:40
0

Often, regexes are used for the lexer (the recognizing of tokens), and something more powerful such as a recursive descent parser is used for recognizing the sequences of tokens, i.e., the actual parsing.

For very simple languages, a regex could be enough, but you would be limiting yourself very much. For example, you cannot parse an expression like (1 + 2) * 3 - 4 using a regex.

gpvos
  • 2,502
  • 1
  • 15
  • 21
  • 3
    Why is it not possible to parse **(1 + 2) * 3 - 4** with a regexp ? – Stephan Mar 22 '11 at 09:47
  • 1
    Ruby's new Oniguruma regex engine provides 'named groups', which _could_ be used for finding properly nested parenthesis, and so on. But I think a RD parser would be easier to write and maintain for full arithmetic... – sarnold Mar 22 '11 at 10:03
  • @Stephan: well, you could parse that specific expression, but not generic expressions that use parentheses to any depth. And even if you would create a regex to match expressions with up to, say, four nested parentheses, it would become much more complex than a more appropriate implementation would. – gpvos Mar 22 '11 at 10:12
  • 2
    You’re overstating matters. See my answer. – tchrist Mar 22 '11 at 18:51
0

Have a look at the GoldParser. It allows the use of regular expression for finding the tokens.

Stephan
  • 37,597
  • 55
  • 216
  • 310
  • Take a look at the extensive answer on parsing html using regular expressions. A GoldParser may not be able to generate an html parser on the grounds that html and XHTML are both more complex languages than those handled by regular expression. – mozillanerd Nov 12 '11 at 23:47