What is the optimal regex for parsing HTML (even though you shouldn't)? Is there a perfect one?

Question

Okay, we all know attempting to parse HTML with Regex brings upon the wrath of Cthulhu. Quite well. And there are some great responses as to why you shouldn't. I accept these, and have posted these links on questions more than once.

But let's put this question within the following scope: we have no option other than Regex to parse HTML. Why? It doesn't matter. But assume for the moment our developers want to lose their minds to Tony the Pony and take the best shot at doing the impossible. If this blows your mind, assume the question to be theoretical then. Whatever floats your boat. Just consider the idea of parsing HTML with regex, even though you shouldn't.

Here we see a claim that it is not possible to do, at least with perfection. But then there's a very wise comment beneath it from @NikiC:

This answer draws the right conclusion ("It's a bad idea to parse HTML with Regex") from wrong arguments ("Because HTML isn't a regular language"). The thing that most people nowadays mean when they say "regex" (PCRE) is well capable not only of parsing context-free grammars (that's trivial actually), but also of context-sensitive grammars (see https://stackoverflow.com/a/7434814/1222420)

Truth is, you can do some incredibly powerful things with modern regex, even if rather verbose. But many make this problem sound like the Halting Problem: you can try, but there will always be another case for which your solution breaks.

So here's the question, and its a bit of a 2-parter.

Is it possible to generate a perfect regular expression for parsing HTML?
- If so, is the proof constructive? Do we only know we can, or has it been done?
If it is not possible, what is the most accurate one out there?

the problem with html pages on the web mostly, lack of verification, if html is verified at least then we can create a regular expression for that specific html edition, if you are asking for a specific verified html (like xhtml 1.0), nothing can stop us from doing it :) — Hawili, Aug 21 '12 at 20:21
@Hawili I'd happily settle for XHTML. There's just a lot of people out there throwing the word "impossible". There was even a website themed "bring me your regexes, I will break with with my HTML". And I've seen some clever examples of how to, such as the classic `
` — Nick, Aug 21 '12 at 20:23
It depends on your definitions of "HTML" and "parse". Do you consider just a specific standard? Do you want to syntactically analyse embedded CSS/Javascript/etc (i.e. not just match something within — Lars Kotthoff, Aug 21 '12 at 20:26
If someone is going to do this, I think more people would be interested in HTML5. — slackwing, Aug 21 '12 at 20:27
In the first link you gave there a few regex only parsing mentioned (especially Sam's post). But even if you have no html parser, you can always take the regex and write your own parser. — BeniBela, Aug 21 '12 at 20:28
@ngmiceli: It's not a claim that Regex are not capable of parsing HTML, it's a fact. Regex can describe type-3 Chomsky languages (regular languages), while HTML (and most other programming/markup languages) is a type-2 Chomsky language (context-free). The former is a subset of the latter. Therefor regex can only describe a subset of HTML. Big-foot is a claim. This is a fact. While PCRE is capable of much more than pure regex, by using its advanced features you'd simply be writing some kind of a context free parser, which too could only parse valid HTML (thus require X/HTMLtidy preprocessing). — Regexident, Aug 21 '12 at 20:30
Yes, you can write a regular expression to parse HTML, assuming that all HTML documents are bounded by some upper length. If you extend that to PCRE, I believe you can parse HTML of arbitrary length, but I'm not certain. — Snowball, Aug 21 '12 at 21:32
@Snowball: No you can not. That is not without PCRE's features which as I said are beyond the scope of true regex. Being strictly left or right extended regex have NO understanding of balance or nesting (neither "S→aSa" nor the mixed linear "S→aA,A→Sb,S→ε"). Therefor you can NOT parse HTML. Document length is NOT important here. A grammar of type-n can always cover a SUBSET of a language of type-(n-x), but NEVER ALL of it. But as soon as you reduce the scope of what you can recognize of language "X" you're NO LONGER recognizing language "X", but a NEW and self-contained subset language "Y". — Regexident, Aug 21 '12 at 22:56
@Regexident: I'm pretty sure my statement was correct. If you know the alphabet and an upper bound on the length, you could make a massive regex that enumerates all the possible HTML documents less than or equal to that length. Please correct me if I'm wrong, though. — Snowball, Aug 22 '12 at 02:27
@Snowball: This is in fact correct. Ironically I unknowlingly mentioned basically the same thing in my answer. However while it is technically a correct regex its basically just a hard-coded dict of all the words allowed in a given language of max length n. Problem is that while a language's grammar might be utterly simple, such as `^[0-9]{1,80}$`, the language itself gets huge quickly (here: 10^80 a.k.a the [approximated number of atoms in the universe](http://www.wolframalpha.com/input/?i=atoms+universe)). So while technically absolutely correct, it's rather useless in real life. :) — Regexident, Aug 22 '12 at 02:52
Good question. Read [Mastering Regular Expressions 3rd Edition](http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124 "Best regex book ever"). The end of the last chapter (on the PHP/PCRE flavor) covers advanced recursive matching of nested structures with examples validating HTML and XML. Yes, it can't be done but, well, actually it _can_ be done! (I do grow very tired of the (mostly ignornant) anti-regex sentiment here at SO.) — ridgerunner, Aug 22 '12 at 04:24
I'm not fully sure why this was closed. I was looking for if it is possible. That's a fact, not subjective discussion. The answer provided that. Seemed pretty solid to me — Nick, Aug 22 '12 at 12:16

Regexident · Accepted Answer · 2012-08-22T00:51:57.267

First of all let's get this straight:

Regex' incompatibility with HTML parsing is NOT a claim. Repeat after me: "Not a claim".

It's a scientifically proven and well known fact. Further more the world was not created in 7 days and big-foot ain't real either. End of discussion.

But let's put this question within the following scope: we have no option other than Regex to parse HTML. Why? It doesn't matter

Funny that you write it doesn't matter. Given, that the "why" is actually what makes it either partially possible or completely impossible what you're planning to do. If there was one thing here that mattered, it'd be the "why".

If the "why" is "validation", then the answer is per definition: not possible. Validation requires no less than 100% language coverage. And regular expressions, being a subset of context-free grammars, therefor cannot cover 100%. By definition.

If the "why" however is "extraction", then you can get quite good results using regex. Never 100% reliable, but good enough for most cases.

Truth is, you can do some incredibly powerful things with modern regex, even if rather verbose.

The sheer length, redundancy and complexity of this pattern shows that while it may not be impossible to describe valid email addresses in regex it at least is disproportionately difficult and does actually rather resemble a brute force dictionary list, than a clean grammar. And while we"re at it: date string validation is even worse. Leap years just to begin with.

To put my differentiation between "validation" and "extraction" into perspective:

To validate a simple email address one needs a monolythic 6400+ chararacters long regular expression.

To "extract" the domain name from an email address however the simple @([^\s]+) or (?<=@)[^\s]+ would cover pretty much (if not exactly) 100%. Assuming the string is isolated and known to be a valid email address.

Is it possible to generate a perfect regular expression for parsing HTML?

You basically answered this one yourself by writing "perfect": No.

If so, is the proof constructive? Do we only know we can, or has it been done?

It's not about "is it just that nobody has managed to do it yet?" but more about "it's been mathematically proven to be impossible!". QED

If it is not possible, what is the most accurate one out there?

Given that it's by definition not possible the only correct answer to this would be "none".

The best approximation of a regex for parsing all (or as much as possible) of HTML would be an infinitely long regex pattern along the lines of x|y|z|… with x, y, z … being all (brute forced) possible productions of the grammar of HTML chained together in an infinitely long logical OR. It would be a proper regex (even to the truest terms of regex), cover all of HTML (it lists and hence matches all possible strings after all), be only theoretically possible (or at least feasible, just like the turing machine) and practically utterly useless.

Regex can describe type-3 Chomsky languages (regular languages), while HTML (and most other programming/markup languages) is a type-2 Chomsky language (context-free). The regular languages are a subset of context-free languages. A grammar of type-n can always cover a subset of a language of type-(n-x), but never all of it. Therefor regex can only describe a subset of HTML. Big-foot is a claim. This is a fact.

Being strictly left or right extended regex have NO understanding of balance or nesting (neither "S→aSa" nor the mixed linear "S→aA,A→Sb,S→ε"). Therefor you can NOT parse HTML.

A quick example for "S→aSa" (balanced nesting):

<div>
    <div>
        ...
    <div>
<div>

Yep, the very core of what's HTML/XML is incompatible with regex. Pretty damn bad position to begin with, isn't it? HTML parsing via regex is literally rotten in its core. Broken by design. Guaranteed to fail.

And one for "S→aA,A→Sb,S→ε" (counting):

It is impossible to validate the correct (matching) number of <td> per row:

<table>
    <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
    </tr>
    <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
    </tr>
</table>

Another thing to keep in mind: as soon as you reduce the scope of what you can recognize of language "X" you're NO LONGER recognizing language "X", but a NEW and self-contained subset language "Y".

In the field of languages it is either all or nothing. There is no in between.

Now to those saying PCRE can do it!: Yes, it's called a context-free grammar then.

…by which it not only would no longer be a regular expression and thus fail the test:

we have no option other than Regex

but also still be the wrong tool to begin with. There are dedicated parsers for such tasks. Use 'em.

The email matching regex (as linked by OP) is a nightmare to read, let alone maintain:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(
?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]
|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) ... (6400+ chars)

While here is an excerpt of the very same specification in form of a proper context-free grammar:

 address     =  mailbox                      ; one addressee
             /  group                        ; named list
 group       =  phrase ":" [#mailbox] ";"
 mailbox     =  addr-spec                    ; simple address
             /  phrase route-addr            ; name & addr-spec
 route-addr  =  "<" [route] addr-spec ">"
 route       =  1#("@" domain) ":"           ; path-relative
 addr-spec   =  local-part "@" domain        ; global address
 local-part  =  word *("." word)             ; uninterpreted
                                             ; case-preserved
 domain      =  sub-domain *("." sub-domain)
 sub-domain  =  domain-ref / domain-literal
 domain-ref  =  atom                         ; symbolic reference

Some people, when confronted with HTML, think "I know, I'll use regular expressions."
Now they have two metric f*cktons of problems.

So tell me. Why on earth would anyone really been far even as decided to use even go want to do look more like?

I dunno if upvotes count on answers to closed questions, but I gave you one, anyway, as this is a fantastic answer. — ebneter, Aug 22 '12 at 01:40
@ebneter: glad you liked it. Had quite some fun writing it. Luckily it it was closed just minutes after I finally hit submit and not before. ;) — Regexident, Aug 22 '12 at 03:05
@Regexident I'm glad you had fun writing it. At first it looked as if this was a greatly tedious endeavor for you. Beautifully written answer, hit all the points, and is a great help to those of us not educated in formal grammar. — Nick, Aug 22 '12 at 12:12
Its also a proven fact that [Bumble-bees can't fly](http://stackoverflow.com/a/4934590/433790). — ridgerunner, Aug 22 '12 at 18:15
@ridgerunner: Regex are a subset of PCRE. PCRE's extended regex are context-free grammars (just as your link states), no regex. And while they might technically be fully capable of parsing type-2 languages PCRE would be the wrong tool for a parsing complexity as found in the HTML specifications with all its variations and often unclear/ambiguous standards. Even if just for the illegibility compared to dedicated context-free parsers (see email validation). It is also absolutely possible to crack salted SHA512, but is it practical to attempt it? Being able to techn. do X doesn't mean one should. — Regexident, Aug 22 '12 at 18:52
Semantics! When discussing "regular expressions", I refer to the "regex" pattern matching functionality found in all modern languages, which are thoroughly covered in Friedl's (excellent) book. The more powerful of these tools neatly handle nested structures accurately, quickly and efficiently. The theoretical "[REGULAR](http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html#comment_40)" expressions of which you speak are simply _not used any more_. The negative knee-jerk reactions that always come up here to any question daring to include the words: REGEX and HTML, is un-justified. — ridgerunner, Aug 22 '12 at 19:59
@ridgerunner: To say that "all modern languages" support recursion (being a basic requirement for parsing type-2) is quite a stretch if you ask me (.NET, std, Qt, Java, Cocoa, …). Anyway, there's no point in battling over implementation details, as regex unfortunately is an ambiguous term. My point was to educate about the limitations of regex, in particular as OP's understanding of what's commonly called regex appeared to be to some degree incomplete. Without knowing about Chomsky's hierarchy one is almost doomed to eventually fall into a trap with regex. My goal was to shine some light on it — Regexident, Aug 22 '12 at 20:43
I didn't say all modern regex engines support recursion (but the powerful Perl, PHP and .NET ones sure do). I said all modern regular expression engines are non-[REGULAR](http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html#comment_40) (and haven't been for a long, long time). — ridgerunner, Aug 22 '12 at 23:35
The comment "Being able to techn. do X doesn't mean one should." seems like a strange and nonconstructive thing to say, as does the corresponding added section about PCRE. No one claimed *one should*, as far as I can see; the question's title explicitly says "(even though you shouldn't)"; how many times does that need to be said? — Don Hatch, Jan 24 '21 at 02:11
Maybe I'm slow (perhaps missing a joke?) but I read the final sentence "Why on earth would anyone really been far even as decided to use even go want to do look more like?" 10 times and I still haven't parsed it successfully. Maybe I need a regular expression? — Don Hatch, Jan 24 '21 at 02:18

What is the optimal regex for parsing HTML (even though you shouldn't)? Is there a perfect one?

1 Answers1