The fundamental reason why regex and HTML don't mix? The theory behind it?

Question

To start with, I cannot do anything but refer to what I believe is the most famous SO post ever:

RegEx match open tags except XHTML self-contained tags

Now, is it even a question for StackOverflow? I don't know, but I'll try...

I'll speak from a personal point of view. While I've never had to do that, I know that the day I have to parse HTML, I will certainly not go with regexes; I'll try and find an HTML parsing library. Fine.

But I don't know why.

At one point, I decided to do CSS validation in Java. I knew "by the guts" that regexes wouldn't cut it, so I used Parboiled.

And I don't know why.

The "why" troubles me. I am no newbie with regexes at all. I just can't put a clear line between what regex engines can, and cannot do.

My question is the following: what is this clear line? What fundamental characteristic of an input must exist so that it is mathematically demonstrated that any regex engine cannot reliably determine success and failure?

Can you give a simple, theoretical input which would spell failure as to a regex engine's ability to give a reliable "match/no match" answer? If yes, what is the defining characteristic of such an input?

EDIT For the sake of this discussion, I'll add a task suggested by a post on SO (which I can't find the link to at the moment, sorry) which is simpler than HTML, but for which I won't use regexes: shell command line parsing.

As far as the shell is concerned, those are equivalent:

alias ll="ls -l"
alias ll=ls\ -l
alias l"l"=ls' -'l
"alia"s l"l= "ls\ -l

Shell quoting mechanisms are so numerous that I'll just create a Parboiled grammar in this case... But this is "out of my guts". Because I find it easier probably... But that doesn't prove that this is not feasible with regexes.

The simple answer is: regular expressions, in the formal sense, are not capable of parsing texts that may contain arbitrary levels of nesting. In HTML/XML, you can have arbitrary nesting of tags. So if you are parsing a page whose template never ever changes, you can use regex to extract a fairly static portion of it, but to account for different levels of nesting, you have to use another solution. — Anorov, Jun 11 '13 at 22:48
@Anorov see my comment on the given answer. With PCRE and .NET regular expressions you *can* parse nested structures. — Martin Ender, Jun 11 '13 at 22:49
@m.buettner Note I said "regular expressions, *in the formal sense*". — Anorov, Jun 11 '13 at 22:52
See [this answer](http://stackoverflow.com/a/4843579/471272) and especially [this one](http://stackoverflow.com/a/4234491/471272), where you will learn that: ⓵ Modern pattern-matching engines have no trouble with arbitrary nesting. ⓶ Just because you ***can*** do something doesn’t mean that you ***should***. — tchrist, Jun 11 '13 at 23:16
Since you said "While I've never had to do that", one way to find out is to try parsing HTML with regex yourself. — doubleDown, Jun 12 '13 at 15:01

recursive · Answer 1 · 2013-06-13T14:17:38.247

6

Regular expressions can determine regular languages. But HTML is not a regular language. It is a context-free language. Context-free languages are a superset of regular languages.

Basically any language that can have recursive elements in it is not regular. Regular languages have to be "flat", so there can be no nesting. In HTML, for example, one <div> can be nested inside another, and there is no limit to the depth they can be nested. It is that type of general nesting that regular expressions can't deal with.

edited Jun 13 '13 at 14:17

answered Jun 11 '13 at 22:41

recursive

77,417
29
137
228

3

Note that while there is the `html` tag (maybe a mistake?), I do not refer to HTML specifically; also, some regex engines allow recursion (for instance, PCRE and `(?R)`; even then, some inputs elude such engines' capabilities to determine success and failure. And this is where I am at a loss. – fge Jun 11 '13 at 22:45
2

I agree that this answer is a simplification. The regular expressions implemented in all of the popular programming languages match way more than regular languages. It starts with backreferences (which cannot help with nested structures though) and ends with recursion in PCRE (as fge) mentioned) and balancing groups in .NET. Apparently (according to a comment on the OP's linked topic), PCRE is even Turing complete. The problem is certainly rather the tons of valid syntax variations (not even speaking of invalid HTML). – Martin Ender Jun 11 '13 at 22:48
@recursive certainly, but the OPs question is where to draw the line and why. And I think that's a fair question. – Martin Ender Jun 11 '13 at 22:57
5

This is misleading. There's an ambiguity of terminology. Originally the term "regex" had a specific meaning. Programming languages invented regex engines. But these engines grew in power (backreferences, lookarounds, recursive patterns, ...). It's no longer true that the original "regular languages" are all that's recognisable. E.g. `/^(a*)b\1$/` recognises strings `b`, `aba`, `aabaa`, `aaabaaa`, ... ; an irregular language. I think you can write a regex to test HTML validity if you really want. (You don't.) In practice, there'll be a library for it, but regexes may be easier in your case. – David Knipe Jun 11 '13 at 23:06
@DavidKnipe I'd appreciate if you could make an answer out of this comment, if you please? – fge Jun 11 '13 at 23:31
Note: HTML is *not* a grammar, it's a language. You can write a grammar to generate a parser for this language however. – Mike Lischke Jun 13 '13 at 11:46
-1 As already explained in these comments and in the other answers, modern regexes are not "regular" in the academic sense and are fully capable of matching nested structures (which is covered in Friedl's [Mastering Regular Expressions](http://shop.oreilly.com/product/9781565922570.do)). Recursion is supported in at least [PHP](http://php.net/manual/en/regexp.reference.recursive.php) and [.NET](http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#balancing_group_definition) – JDB still remembers Monica Jun 13 '13 at 21:16

score 3 · Answer 2 · answered Jun 13 '13 at 11:54

Regular expressions are mostly to match a given pattern against an input string and see if that succeeds. That's their primary goal. RE libraries offer additional features like getting subparts of an input string based on the match, but that's feasible only for few parts. If you are going to need a full representation of your input you need a parse tree. Every parser can easily generate this for you, since this is one of their tasks. With RE you have too do this manually.

Another point is the complexity of your expression if you would use regular expressions. Difficult to test for errors and you mostly get all or nothing, either it matches successfully (and you get your desired info) or you get nothing and have to find what's wrong with it. Using a parser generator you can interactively build your grammar to get more and more info, not to mention that you probably find an HTML grammar for every relevant parser out there already.

Finally, don't forget feedback for invalid input. With RE you get nothing. With a parser you get error messages that point you to the actual problem. Some parsers (like those generated by ANTLR) even can cope with simple syntax errors and still generate a usable parse tree for you.

score 1 · Answer 3 · answered Jun 13 '13 at 20:32

You say you've heard that regexes can't parse HTML. This is misleading: there's an ambiguity of terminology.

Originally the term "regex" had a specific, mathematical meaning. Naturally, programming languages invented regex engines. But in time these engines grew in power (backreferences, lookarounds, recursive patterns, ...). It's no longer true that the original "regular languages" are the only languages recognisable by regex engines.

For example, /^(a*)b\1$/ recognises strings b, aba, aabaa, aaabaaa, etc; this is not a regular language.

I think you can write a regex to test HTML validity if you really want to. (You don't.) In practice, there'll be a library for it in whatever language you're using, but regexes may be easier, depending on your use case.

score 0 · Answer 4 · answered Jun 12 '13 at 19:38

I think the best answer you can get here is the old adage: "When all you have is a hammer, the whole world looks like a nail." Regular expressions can do pretty near anything. Their power is in their ability to work with any string. However, just because you can use something doesn't mean you should. Regular expressions are slow, and largely inefficient (you can optimize them in many ways, but very few people know those techniques and even fewer actually take the time to implement them and to thorough test and check their regular expressions).

In the case of HTML, there's better tools out there. Tools that are faster than regular expressions and more suited to working with HTML (capable of building node trees, etc). It's not so much that you shouldn't use regular expressions to parse HTML, it's that there's a better tool(s). Why would you try to saw a tree with a butter knife when you can use a chain-saw?

The fundamental reason why regex and HTML don't mix? The theory behind it?

4 Answers4