44

For argument's sake lets assume a HTML parser.

I've read that it tokenizes everything first, and then parses it.

What does tokenize mean?

Does the parser read every character each, building up a multi dimensional array to store the structure?

For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere?

I'm interested for the sake of knowing (I'm curious).

If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?

alex
  • 438,662
  • 188
  • 837
  • 957
  • Look at http://en.wikipedia.org/wiki/Lexical_parser for a very brief intro; also check out the `Parsing` article there. And HTML Purifier, at some point, does exactly that. – Piskvor left the building Jun 30 '10 at 14:40
  • The HTML Agility Pack is open source and is based on a tokanizer. http://htmlagilitypack.codeplex.com/ – Oded Jun 30 '10 at 14:43
  • If you can read C (ocaml, lisp), try looking at some tutorials on yacc/lex (ocamlyacc/ocamllex, cl-yacc/cl-lex...). You'll understand the basics quickly from the code. If you can read code. – Amadan Jun 30 '10 at 14:48
  • Well, regex is probably the simplest way of explaining tokenization, but in the case of HTML, it's not sufficient - I'd say it could be the first step, with additional processing required. – Piskvor left the building Jun 30 '10 at 15:07
  • Good question I think. Understanding parsing, pretty root to lots of code type things. – Mark Schultheiss Jun 30 '10 at 15:30
  • 1
    Really good article on this topic: http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/ – Jordan Brown Sep 09 '11 at 20:15

5 Answers5

62

Tokenizing can be composed of a few steps, for example, if you have this html code:

<html>
    <head>
        <title>My HTML Page</title>
    </head>
    <body>
        <p style="special">
            This paragraph has special style
        </p>
        <p>
            This paragraph is not special
        </p>
    </body>
</html>

the tokenizer may convert that string to a flat list of significant tokens, discarding whitespaces (thanks, SasQ for the correction):

["<", "html", ">", 
     "<", "head", ">", 
         "<", "title", ">", "My HTML Page", "</", "title", ">",
     "</", "head", ">",
     "<", "body", ">",
         "<", "p", "style", "=", "\"", "special", "\"", ">",
            "This paragraph has special style",
        "</", "p", ">",
        "<", "p", ">",
            "This paragraph is not special",
        "</", "p", ">",
    "</", "body", ">",
"</", "html", ">"
]

there may be multiple tokenizing passes to convert a list of tokens to a list of even higher-level tokens like the following hypothetical HTML parser might do (which is still a flat list):

[("<html>", {}), 
     ("<head>", {}), 
         ("<title>", {}), "My HTML Page", "</title>",
     "</head>",
     ("<body>", {}),
        ("<p>", {"style": "special"}),
            "This paragraph has special style",
        "</p>",
        ("<p>", {}),
            "This paragraph is not special",
        "</p>",
    "</body>",
"</html>"
]

then the parser converts that list of tokens to form a tree or graph that represents the source text in a manner that is more convenient to access/manipulate by the program:

("<html>", {}, [
    ("<head>", {}, [
        ("<title>", {}, ["My HTML Page"]),
    ]), 
    ("<body>", {}, [
        ("<p>", {"style": "special"}, ["This paragraph has special style"]),
        ("<p>", {}, ["This paragraph is not special"]),
    ]),
])

at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.

Lie Ryan
  • 55,117
  • 12
  • 87
  • 139
  • +1 for answering (accidentaly) my long-lasting question about which fragments of text should constitute tokens in HTML / XML / SGML-based languages! (I asked about that in other threads.) Thanks, man! Very good example, indeed! :-) – SasQ Sep 09 '11 at 13:30
  • 1
    I'd only correct that according to W3C standard, white spaces shouldn't be dropped, but passed to the implementation to decide (and there ultimately dropped in most cases except `PRE` or similar elements, which preserve white spaces). – SasQ Sep 09 '11 at 20:13
  • 1
    @SasQ: Thanks, though I don't recommend my answer for people wanting to learn how HTML is actually parsed, as I wrote this answer without any knowledge an actual HTML parser or reading the HTML spec. My answer is only to illustrate tokenizing process, and if it accidentally matches how an actual HTML parser works, then that would be an awesome coincidence. – Lie Ryan Sep 09 '11 at 22:00
  • This is a phenomenal, helpful answer. Thank you! – Rob L Jul 20 '14 at 00:38
32

First of all, you should be aware that parsing HTML is particularly ugly -- HTML was in wide (and divergent) use before being standardized. This leads to all manner of ugliness, such as the standard specifying that some constructs aren't allowed, but then specifying required behavior for those constructs anyway.

Getting to your direct question: tokenization is roughly equivalent to taking English, and breaking it up into words. In English, most words are consecutive streams of letters, possibly including an apostrophe, hyphen, etc. Mostly words are surrounded by spaces, but a period, question mark, exclamation point, etc., can also signal the end of a word. Likewise for HTML (or whatever) you specify some rules about what can make up a token (word) in this language. The piece of code that breaks the input up into tokens is normally known as the lexer.

At least in a normal case, you do not break all the input up into tokens before you start parsing. Rather, the parser calls the lexer to get the next token when it needs one. When it's called, the lexer looks at enough of the input to find one token, delivers that to the parser, and no more of the input is tokenized until the next time the parser needs more input.

In a general way, you're right about how a parser works, but (at least in a typical parser) it uses a stack during the act of parsing a statement, but what it builds to represent a statement is normally a tree (and Abstract Syntax Tree, aka AST), not a multidimensional array.

Based on the complexity of parsing HTML, I'd reserve looking at a parser for it until you've read through a few others first. If you do some looking around, you should be able to find a fair number of parsers/lexers for things like mathematical expressions that are probably more suitable as an introduction (smaller, simpler, easier to understand, etc.)

Jerry Coffin
  • 437,173
  • 71
  • 570
  • 1,035
10

Don't miss the W3C's notes on parsing HTML5.

For an interesting introduction to scanning/lexing, search the web for Efficient Generation of Table-Driven Scanners. It shows how scanning is ultimately driven by automata theory. A collection of regular expressions is transformed into a single NFA . The NFA is then transformed to a DFA to make state transitions deterministic. The paper then describes a method to transform the DFA into a transition table.

A key point: scanners use regular expression theory but likely don't use existing regular expression libraries. For better performance, state transitions are coded as giant case statements or in transition tables.

Scanners guarantee that correct words(tokens) are used. Parsers guarantee the words are used in the correct combination and order. Scanners use regular expression and automata theory. Parsers use grammar theory, especially context-free grammars.

A couple parsing resources:

Andy Mercer
  • 5,605
  • 5
  • 40
  • 80
Corbin March
  • 24,840
  • 6
  • 69
  • 99
  • +1 thanks for the W3C link. It looks like an informative (and long) read! – alex Jul 01 '10 at 01:53
  • And, if your syntax won't change in the future, you can "bake" your transition table right into the source code and compile it once for all. It's possible because the machine you're running your program on is actually a state automaton too! So you can "implement your automaton in the hardware". Here's how: The state could be represented by the position in code (instruction pointer in the CPU). State transitions are just (un)conditional jumps (branches). You can also use the program's stack for storing/restoring the state (procedure calls & returns). This will speed up things a lot. – SasQ Sep 09 '11 at 13:36
7

HTML and XML syntax (and others based on SGML) are quite hard to parse and they don't fit well into the lexing scenario, because they're not regular. In the parsing theory, a regular grammar is the one with doesn't have any recursion, that is, self-similar, nested patterns, or parentheses-like wrappers which have to match each other. But HTML/XML/SGML-based languages does have nested patterns: tags could be nested. Syntax with nesting patterns is higher in level in the Chomsky's classification: it's context-free or even context-dependent.

But back to your question about lexer:
Each syntax consists of two kinds of symbols: non-terminal symbols (those which unwind into other syntax rules) and terminal symbols (those which are "atomic" - they are leafs of the syntax tree and don't unwind into anything else). Terminal symbols are often just the tokens. Tokens are pumped one by one from the lexer and matched to their corresponding terminal symbols.

Those terminal symbols (tokens) have often regular syntax, which is easier to recognize (and that's why it's factored out to the lexer, which is more specialized for regular grammars and could do it quicker than by using more general approach of non-regular grammars).

So, to write a lexer for HTML/XML/SGML-like language, you need to find parts of the syntax which are atomic enough and regular, to be dealt with easily by the lexer. And here the problem arises, because it's not at first obvious which parts are these. I struggled with this problem for a long time.

But Lie Ryan above have done a very good job in recognizing these parts. Bravo for him for that! The token types are following:

  • TagOpener: < lexeme, used for starting tags.
  • TagCloser: > lexeme, used for ending tags.
  • ClosingTagMarker: / lexeme used in closing tags.
  • Name: alphanumeric sequence starting with letter, used for tag names and attribute names.
  • Value: Text which can contain variety of different characters, spaces etc. Used for values of attributes.
  • Equals: = lexeme, used for separating attribute names from its values.
  • Quote: ' lexeme, used for enclosing attribute values.
  • DoubleQuote: " lexeme, used for enclosing attribute values.
  • PlainText: Any text not containing < character directly and not covered by the above types.

You can also have some tokens for entity references, like &nbsp; or &amp;. Probably:

  • EntityReference: a lexeme consisting of & followed by some alphanumeric characters and ended with ;.

Why I used separate tokens for ' and " and not one token for attribute value? Because regular syntax couldn't recognize which of these characters should end the sequence - it depends on the character which started it (ending character have to match the starting character). This "parenthesizing" is considered non-regular syntax. So I promote it into a higher level - to the Parser. It'd be his job to match these tokens (starting and ending) together (or none at all, for simple attribute values not containing spaces).

Afterthought: Unfortunately, some of these tokens may occur only inside other markup. So the use of lexical contexts is needed, which after all is another state machine controlling the state machines recognizing particular tokens. And that's why I've said that SGML-like languages don't fit well into the schema of lexical analysis.

t3chb0t
  • 11,726
  • 9
  • 65
  • 96
SasQ
  • 11,814
  • 5
  • 39
  • 42
3

This is how HTML 5 Parser works:

This is how HTML 5 Parser works

Tunaki
  • 116,530
  • 39
  • 281
  • 370
Vivek Kumar
  • 1,073
  • 2
  • 12
  • 21