How is the DOM parsed?

Question

Possible Duplicate:
If you're not supposed to use Regular Expressions to parse HTML, then how are HTML parsers written?

My question is simple: How do current DOM parsers actually parse the DOM from a string (XML, HTML, or otherwise)?

I know you shouldn't parse html with RegEx, but couldn't a DOM parser use RegEx to match patterns for open/close tags? Or, is there a good once-over algorithm for parsing the provided string as a character array?

But to answer this exact question quickly: Most propably do use regexes - but only **for tokenization** (e.g. recognizing opening and closing tags). — , Jan 09 '11 at 07:04
I missed that question somehow, and I've voted to close this copy down. — zzzzBov, Jan 09 '11 at 07:08

score 4 · Accepted Answer · edited May 23 '17 at 12:07

4

Look at this:

alt text

Here is a good Example

edited May 23 '17 at 12:07

Community

1
1

answered Jan 09 '11 at 07:00

Naveed

38,915
31
92
129

score 0 · Answer 2 · answered Jan 09 '11 at 07:07

0

Well, you could start with a basic approach along the lines of:

http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c

And then just expand it to store everything into the full DOM tree structure.

answered Jan 09 '11 at 07:07

Jonathan Wood

59,750
65
229
380

How is the DOM parsed?

2 Answers2