1

Possible Duplicate:
If you're not supposed to use Regular Expressions to parse HTML, then how are HTML parsers written?

My question is simple: How do current DOM parsers actually parse the DOM from a string (XML, HTML, or otherwise)?

I know you shouldn't parse html with RegEx, but couldn't a DOM parser use RegEx to match patterns for open/close tags? Or, is there a good once-over algorithm for parsing the provided string as a character array?

Community
  • 1
  • 1
zzzzBov
  • 157,699
  • 47
  • 307
  • 349
  • Depends on the parser implementation doesn't it? – Ed S. Jan 09 '11 at 07:00
  • But to answer this exact question quickly: Most propably do use regexes - but only **for tokenization** (e.g. recognizing opening and closing tags). –  Jan 09 '11 at 07:04
  • I missed that question somehow, and I've voted to close this copy down. – zzzzBov Jan 09 '11 at 07:08

2 Answers2

4

Look at this:

alt text

Here is a good Example

Community
  • 1
  • 1
Naveed
  • 38,915
  • 31
  • 92
  • 129
0

Well, you could start with a basic approach along the lines of:

http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c

And then just expand it to store everything into the full DOM tree structure.

Jonathan Wood
  • 59,750
  • 65
  • 229
  • 380