In what scopes do special HTML characters need to be escaped?

Question

In HTML,

<a HREF="http://...... & .... ">Dust & Bones</a>

needs to be escaped as follows:

<a href="http://...... &amp; .... ">Dust &amp; Bones</a>

What's the scope of where &amp needs to be applied. Is it just href or is it anywhere within HTML text? What about

<input value="http://... & ">?

or within

<script>... & ... </script>

do these need escaping?

update

The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them? Is it done once on the whole document, or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB -- different parsing rules seem to apply within , so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?

@alfasin — `href="foo.php?a=b©=bar"`. Now that's a copyright symbol and not an ampersand followed by `copy;`. — Quentin, Jan 24 '14 at 22:28
Go and read this - http://en.wikipedia.org/wiki/Percent-encoding — admdrew, Jan 24 '14 at 22:28
@Quentin not that I understand why would you want to pass as a key the copyright symbol, but anyways, that's *not* the example he gave. — Nir Alfasi, Jan 24 '14 at 22:30
@admdrew -- it only describes the encoding, not where is it applied. — user48956, Jan 24 '14 at 22:31
@Quentin and user48956 - Yup, I know, it's just a useful read. — admdrew, Jan 24 '14 at 22:32
@alfasin — The point is that if what you said was true, then it wouldn't be a copyright symbol, and the question is asking when an `&` needs to be represented as `&` not if it needs to be represented that way in that particular example. — Quentin, Jan 24 '14 at 22:32
possible duplicate of [Do I really need to encode '&' as '&'?](http://stackoverflow.com/questions/3493405/do-i-really-need-to-encode-as-amp) — Niels B., Jan 24 '14 at 22:33
@Quentin maybe you're right and it's a better approach (to encode anything that is not inside a script), I usually prefer to *not* encode unless I have to (in case I pass a URL as a parameter for example). — Nir Alfasi, Jan 24 '14 at 22:35
@alfasin — It's really easier to just keep track of if a URL is expressed as text or as HTML or as whatever else and encode or decode as needed then it is to try to remember that you need to encode a URL containing `©` as `©` but not one that contains `&=` as `&=` (even though you *may*). (Which I think is true, but I won't swear to as I haven't found an clear description of the exceptions from a content author's perspective and reverse engineering one from the parser rules in the HTML 5 specification is more effort then I'm willing to go to). — Quentin, Jan 24 '14 at 22:39
@moderator The question is not answered here: http://stackoverflow.com/questions/3493405/do-i-really-need-to-encode-as-amp i) Most of the answers describe what need to be done within the text between tags, not within the tag definitions themselves. ii) Its completely unclear whether HTML parsers first look for instances of &..; and apply transformations to the whole document first. I suspect this is not how it work. Aren't script and cdata scopes handled differently? — user48956, Jan 24 '14 at 22:52

Quentin · Accepted Answer · 2014-01-25T08:42:07.857

The rules vary depending on the version of HTML you are dealing with but are always more complex then is worth trying to remember.

The safe approach is "Use character references to represent the 5 HTML special characters everywhere except inside script and style elements", which makes you safe for everything except XHTML.

For XHTML the rule is the same with the additional proviso of "and use explicit CDATA sections in script and style elements".

The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them?

As it parses the HTML (depending on what the current state of the tokeniser is ("inside start tag" and "inside attribute value" are examples of different states)).

Is it done once on the whole document

Unless you trigger additional HTML parsing (e.g. by setting innerHTML on an element).

or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB

Different rules apply in different places. The complete, current rules are (as I suggested in a comment) rather complex and would require a lot of work to extract from the HTML 5 parsing rules. This is why I suggest, if you are an HTML author and not a browser author, using the simpler rules of "Use character references unless you are in a script or style element".

-- different parsing rules seem to apply within <script>, so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?

In HTML 4 terms, script and style elements are defined as containing CDATA (where the only sequence of characters with special meaning in HTML are </ which terminates the CDATA section). Everywhere else in the document (including, counter-intuitively, attribute values that are defined as containing CDATA) & indicates the start of a character reference (although there might be a few exceptions based on what the character following the & is).

The HTML 5 rules are more complicated, but the basic principle of "It is safe and sane to use character references for &, <, >, " and ' everywhere except inside script and style elements" holds.

In what scopes do special HTML characters need to be escaped?

1 Answers1