218

I'm using an '&' symbol with HTML5 and UTF-8 in my site's <title>. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles.

http://validator.w3.org is giving me this:

& did not start a character reference. (& probably should have been escaped as &amp;.)

Do I really need to do &amp;?

I'm not fussed about my pages validating for the sake of validating, but I'm curious to hear people's opinions on this and if it's important and why.

Richard J. Ross III
  • 53,315
  • 24
  • 127
  • 192
Haroldo
  • 33,209
  • 46
  • 123
  • 164
  • 64
    The specs do not say so. The poster refers to HTML5 which does not require escaping of the ampersand in all scenarios. – Matthew Wilson Aug 16 '10 at 13:39
  • 2
    This should be Community Wiki, as you're looking for opinions, and not being fussy about validation implies that there's no objective basis upon which to answer. – Richard JP Le Guen Aug 16 '10 at 14:06
  • 8
    @Richard: really? While I don't agree that "validation doesn't matter", I see this as a very objective question: "does this break anything other than the spec?" – Joachim Sauer Aug 16 '10 at 14:11
  • @Joachim Sauer - Your example is a good question... that's not what the question is though :P The exact words "I'm curious to hear people's opinions" even appear in the text! – Richard JP Le Guen Aug 16 '10 at 14:16
  • 2
    @Richard: I disagree here. "Do I really need to do `&`?" and "[...] I'm curious to hear people's opinions on this and **if it's important and why**." (emphasis mine). Those two indicate that he's interested in factual information, but knows that much of this is open to at least some interpretation, so he asks for multiple opinions. – Joachim Sauer Aug 16 '10 at 14:18
  • @Joachim Sauer - This is true. I acknowledge the validity of your opinion... but stand by my own as well ;) – Richard JP Le Guen Aug 16 '10 at 14:25
  • 2
    @YiJiang **Current web browsers go to great lengths to *understand* the user**. **And so does Google**. It's part of the Spec. Future web-browsers *may* be less forgiving. So it's always a good idea to check how Wikipedia does it, and copy them. – unixman83 Feb 11 '12 at 10:50
  • When xslt transforming xml to html it will not escape & as & in attribute values. – jontro Jun 07 '12 at 11:51
  • @unixman83 That is a good approach: see how wikipedia does it – Kzqai Oct 09 '13 at 20:56
  • Google itself uses `&` in href urls. View source on http://www.google.com/ or https://plus.google.com/ I tend to like to follow the example of major players on these questionable subjects – User Mar 13 '14 at 18:54
  • Here's the [w3 spec](http://www.w3.org/TR/REC-html40/charset.html#h-5.3.2) – rnevius May 07 '14 at 08:57
  • **Reserved characters in HTML must be replaced with character entities.** Test Example on this [URL](http://www.w3schools.com/html/html_entities.asp): `var element = document.evaluate('//table[@class="w3-table-all notranslate"]/tbody/tr[5]/td', window.document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null ).singleNodeValue; console.log('HTML:', element.innerHTML); var JS = (element.innerHTML).replace('&', '&'); console.log(JS);` – Yash Feb 10 '16 at 05:43
  • 2
    The HTML spec says to accept crap input. Does that mean your site is "allowed" to be crap now? Close tags that need to be closed and escape things! Come on people. – doug65536 Aug 21 '16 at 09:17
  • I personally escape `&`, if assigned via JavaScript `element.innerHTML = '&'` or assigned to HTML directly, but it's not going to cause HTML to be parsed incorrectly. What causes a problem is quotes and `>` and `' is okay too!"`, however you would want to do ``. You don't have to self close that or do the `&`. `.innerHTML` should be escaped like raw HTML. With JavaScript `element.value =` there is no need. – StackSlave Dec 06 '19 at 00:35
  • Related post - [What is &amp used for](https://stackoverflow.com/q/9084237/465053) – RBT Jan 22 '20 at 06:36

15 Answers15

151

Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as &amp; and everything would be fine.

HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.

Keep this point in mind; if you're not escaping & to &amp;, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.

Please just escape your code. It will save you a lot of trouble in the future.

Delan Azabani
  • 73,106
  • 23
  • 158
  • 198
  • 11
    No browser will ever "misinterpret" a & by itself. Every existing browser displays it as "&". Considering he explicitly asked for practical reason to do it, and that he stated that he doesn't care about validation.. – Thomas Bonini Aug 16 '10 at 13:13
  • 51
    Yes. But morally, should we be *relying* on the leniency and "nice" error handling of browsers? Or should we just write correct code? – Delan Azabani Aug 16 '10 at 13:15
  • 8
    @Delan: while I try to make every page I write validate, I understand from reading his question that he doesn't care about "morally". He just cares if it works or not. They are two different philosophies and both have their pros and cons, and there is not a "correct" one. For example this website doesn't validate, and yet it's a great website. – Thomas Bonini Aug 16 '10 at 13:16
  • Also, even if it was XHTML it wouldn't "break the parsing" unless the content type was set to application/xhtml+xml, which no one does because it's dumb that instead of gracefully handling an error the browser must quit. (That's why XHTML is being discontinued in favor of HTML 5) – Thomas Bonini Aug 16 '10 at 13:18
  • 3
    @Andreas, but browsers have enough bugs in how they interpret correct code, depending on them getting the right results when you send them meaningless markup is chancy. It may work today with that example, and then fail with the next example (say if the next example has a semi-colon somewhere after the &) – Jon Hanna Aug 16 '10 at 13:20
  • @Jon: I agree that it's in all cases **better** if your pages validate. I'm obviously not contesting that. The gray area is this: is it worth spending X hours of development time to make them validate, or is it better to take the slight risk that in the future, somehow, things may break? I personally think it's worth it, but I don't blame people who think it's not (such as Jeff Atwood) since it's such a gray area. One thing is certain: making pages validate costs money, and it's something important to consider. – Thomas Bonini Aug 16 '10 at 13:22
  • 1
    In this case, you are wrong. It doesn't take X hours or Y dollars to make it validate *for this particular case*. It's a simple case of `preg_replace('/&/','&',$code);` – Delan Azabani Aug 16 '10 at 13:24
  • 13
    Everyone seems to be talking about HTML5, but the original question states that HTML5 is in use. HTML5 explicitly allows an unescaped & in this situation, unless what follows & would normally expand to an entity (eg &copy=2 is problematic but &x=2 is fine). – Matthew Wilson Aug 16 '10 at 13:26
  • @Andreas Bonini: You’re wrong. At least Firefox and Opera follow the rules and will interpret the following correctly: `foo§=bar`. – Gumbo Aug 16 '10 at 13:26
  • 1
    Until you've spent the X hours of development time making them validate (X should really be < 1 in most cases) then you don't know why they aren't validating. If you've been paying even reasonable attention to the code in the meantime, then why do you suddenly have nonsense output? You're going to have to investigate to make sure you don't have a serious bug, and then it's 5secs to fix it anyway. One of the big advantages of keeping things valid is that things suddenly being invalid can rapidly flag a subtle bug that would be missed if everything output was gibberish. – Jon Hanna Aug 16 '10 at 13:27
  • 1
    Making pages validate doesn’t really cost any money at all—at least not if you’re creating new ones. Maintaining invalid ones if things break costs money. – igor Aug 16 '10 at 13:27
  • Gosh-darn it. I missed the HTML 5 bit in the question! – Jon Hanna Aug 16 '10 at 13:28
  • @Gumbo: I explicitly said **a & by itself**. In your example it's not by itself is it? – Thomas Bonini Aug 16 '10 at 13:58
  • 1
    @Delan You say that HTML5 allow it unless *it looks like* a valid character reference. What do you mean by *looks like* exactly? Surely the standard is more precise than this. – Alex Jasmin Aug 16 '10 at 23:52
  • `&copy=3` 'looks' like a valid entity as `©` is defined. According to HTML5, this kind of thing *definitely* should be escaped. `&asldfj=4` does not look like a defined reference, so it doesn't *need* to be, but should be escaped anyway for reasons I've stated above in my answer. – Delan Azabani Aug 17 '10 at 00:24
60

Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.

Encoding & as &amp; under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.

Compare the following: which is easier? which is easier to bugger up?

Methodology 1

  1. Write some content which includes ampersand characters.
  2. Encode them all.

Methodology 2

(with a grain of salt, please ;) )

  1. Write some content which includes a ampersand characters.
  2. On a case-by-case basis, look at each ampersand. Determine if:
    • It is isolated, and as such unambiguously an ampersand. eg. volt & amp
       > In that case don't bother encoding it.
    • It is not isolated, but you feel it is nonetheless unambiguous, as the resulting entity does not exist and will never exist since the entity list could never evolve. eg amp&volt
       > In that case don't bother encoding it.
    • It is not isolated, and ambiguous. eg. volt&amp
       > Encode it.

??

Richard JP Le Guen
  • 26,771
  • 7
  • 80
  • 113
  • 3
    The second case of `amp&volt` *is* ambiguous: Is `&volt` now an entity reference or not? – Gumbo Aug 16 '10 at 14:40
  • 7
    @Gumbo The ampersand in `amp&volt` is *not* an ambiguous ampersand (as per the definition in the HTML spec). See http://mathiasbynens.be/notes/ambiguous-ampersands and http://mothereff.in/ampersands#amp%26volt. – Mathias Bynens Jan 09 '12 at 12:58
  • @MathiasBynens By now (2019), the [definition of an ambiguous ampersand](https://html.spec.whatwg.org/multipage/syntax.html#syntax-ambiguous-ampersand) seems to have changed a bit from the definition you quoted back in 2011 in https://mathiasbynens.be/notes/ambiguous-ampersands . – Jacob C. Dec 17 '19 at 20:53
24

HTML5 rules are different from HTML4. It's not required in HTML5 - unless the ampersand looks like it starts a parameter name. "&copy=2" is still a problem, for example, since &copy; is the copyright symbol.

However it seems to me that it's harder work to decide to encode or not to encode depending on the following text. So the easiest path is probably to encode all the time.

Matthew Wilson
  • 3,723
  • 18
  • 13
  • 2
    It’s like quoting attribute values — you don’t have to, but you can’t go wrong if you do it all the time. – Paul D. Waite Aug 23 '10 at 23:10
  • 3
    `&copy=2` is not as big of a problem as you may think. In attribute values (e.g. the `href` attribute), the `&copy` won’t be considered as a character reference for `©`. Outside an attribute value, it would. – Mathias Bynens Sep 30 '13 at 10:51
  • Given that an ampersand is normally preceded and followed by a space in English text, it's not difficult to remember or think about the rule I follow: If the ampersand is not touching another visible character, which is almost always, then it doesn't need encoding. Otherwise, just encode for simplicity's sake. – Carl Smith Apr 03 '17 at 22:41
  • Could you add a reference to the HTML5 rules? – Ferrybig Jun 12 '18 at 06:57
18

I think this has turned into more of a question of "why follow the spec when browser's don't care." Here is my generalized answer:

Standards are not a "present" thing. They are a "future" thing. If we, as developers, follow web standards, then browser vendors are more likely to correctly implement those standards, and we move closer to a completely interoperable web, where CSS hacks, feature detection, and browser detection are not necessary. Where we don't have to figure out why our layouts break in a particular browser, or how to work around that.

Specifically, if HTML5 does not require using &amp; in your specific situation, and you're using an HTML5 doctype (and also expecting your users to be using HTML5-compliant browsers), then there is no reason to do it.

Ryan Kinal
  • 16,165
  • 4
  • 41
  • 63
  • 1
    With that being said, generally speaking, you must remember that most of the "standard" ways are still in draft mode and may change in the future. – refaelio Jun 26 '14 at 12:00
7

Well, if it comes from user input then absolutely yes, for obvious reasons. Think if this very website didn't do it: the title of this question would show up as do i really need to encode ‘&’ as ‘&’?

If it's just something like echo '<title>Dolce & Gabbana</title>'; then strictly speaking you don't have to. It would be better, but if you don't no user will notice the difference.

Thomas Bonini
  • 40,716
  • 28
  • 117
  • 153
7

Could you show us what your title actually is? When I submit

<!DOCTYPE html>
<html>
<title>Dolce & Gabbana</title>
<body>
<p>am i allowed loose & mpersands?</p>
</body>
</html>

to http://validator.w3.org/ - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the &s...

AakashM
  • 59,217
  • 16
  • 147
  • 181
  • 2
    Yes, HTML5 has a different parser than previous HTML and XHTML parsers, and allows unescaped ampersands in certain situations. – kevinji Apr 15 '11 at 19:12
  • As far as these examples go, this is nothing new in HTML5. Both `Dolce & Gabbana` and `

    Dolce & Gabbana

    ` are valid HTML 2.0.
    – Mathias Bynens Jan 09 '12 at 14:10
6

In HTML a & marks the begin of a reference, either of a character reference or of an entity reference. From that point on the parser expects either a # denoting a character reference, or an entity name denoting an entity reference, both followed by a ;. That’s the normal behavior.

But if the reference name or just the reference opening & is followed by a white space or other delimiters like ", ', <, >, &, the ending ; and even a reference to represent a plain & can be omitted:

<p title="&amp;">foo &amp; bar</p>
<p title="&amp">foo &amp bar</p>
<p title="&">foo & bar</p>

Only in these cases the ending ; or even the reference itself can be omitted (at least in HTML 4). I think HTML 5 requires the ending ;.

But the specification recommends to always use a reference like the character reference &#38; or the entity reference &amp; to avoid confusion:

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values.

Gumbo
  • 594,236
  • 102
  • 740
  • 814
  • 1
    That's the HTML 4 spec you link to; from my reading of the (draft) HTML 5 spec, only *ambiguous* ampersands are disallowed. An ampersand followed by a space, for example, isn't ambiguous, and so (again by my reading) should be permitted - see my answer for markup that the HTML 5 validator accepts. – AakashM Aug 16 '10 at 14:29
  • 1
    @AakashM: I’m not sure, it sounded like that. – Gumbo Aug 16 '10 at 15:39
4

Update (March 2020): The W3C validator no longer complains about escaping URLs.

I was checking why Image URL's need escaping, hence tried it in https://validator.w3.org. The explanation is pretty nice. It highlights that even URL's need to be escaped. [PS:I guess it will unescaped when its consumed since URL's need &. Can anyone clarify?]

<img alt="" src="foo?bar=qut&qux=fop" />

An entity reference was found in the document, but there is no reference by that name defined. Often this is caused by misspelling the reference name, unencoded ampersands, or by leaving off the trailing semicolon (;). The most common cause of this error is unencoded ampersands in URLs as described by the WDG in "Ampersands in URLs". Entity references start with an ampersand (&) and end with a semicolon (;). If you want to use a literal ampersand in your document you must encode it as "&" (even inside URLs!). Be careful to end entity references with a semicolon or your entity reference may get interpreted in connection with the following text. Also keep in mind that named entity references are case-sensitive; &Aelig; and æ are different characters. If this error appears in some markup generated by PHP's session handling code, this article has explanations and solutions to your problem.

tronman
  • 8,540
  • 9
  • 41
  • 47
Nishant
  • 17,152
  • 14
  • 56
  • 80
  • 1
    Read the top-voted answer. Attributes are #PCDATA and therefore parsed. Entities are handled there. In your example, the `&` starts an entity reference. After reading `&qux`, the parser finds no final semicolon (`;`), but runs into an equals sign (`=`), which cannot be a part of entity name. This should be parse error, if the parser tried to be really strict (according to HTML 4). In HTML 5, entities parsing is overall more relaxed. – Palec Apr 20 '16 at 10:56
  • 1
    I suspect that in general it is best to use `;` as a separator in query strings (when you control the link) for that reason. – Demi Aug 31 '16 at 19:32
4

It depends on the likelihood of a semicolon ending up near your &, causing it to display something quite different.

For example, when dealing with input from users (say, if you include the user-provided subject of a forum post in your title tags), you never know where they might be putting random semicolons, and it might randomly display strange entities. So always escape in that situation.

For your own static html, sure, you could skip it, but it's so trivial to include proper escaping, that there's no good reason to avoid it.

yoniLavi
  • 2,157
  • 24
  • 23
Douglas
  • 32,530
  • 8
  • 68
  • 88
4

If the user passes it to you, or it will wind up in a URL, you need to escape it.

If it appears in static text on a page? All browsers will get this one right either way, you don't worry much about it, since it will work.

Dean J
  • 35,669
  • 13
  • 61
  • 92
3

Yes, you should try to serve valid code if possible.

Most browsers will silently correct this error, but there is a problem with relying on the error handling in the browsers. There is no standard for how to handle incorrect code, so it's up to each browser vendor to try to figure out what to do with each error, and the results may vary.

Some examples where browsers are likely to react differently is if you put elements inside a table but outside the table cells, or if you nest links inside each other.

For your specific example it's not likely to cause any problems, but error correction in the browser might for example cause the browser to change from standards compliant mode into quirks mode, which could make your layout break down completely.

So, you should correct errors like this in the code, if not for anything else so to keep the error list in the validator short, so that you can spot more serious problems.

Guffa
  • 640,220
  • 96
  • 678
  • 956
3

A couple of years ago, we got a report that one of our web apps wasn't displaying correctly in Firefox. It turned out that the page contained a tag that looked like

<div style="..." ... style="...">

When faced with a repeated style attribute, IE combines both of the styles, while Firefox only uses one of them, hence the different behavior. I changed the tag to

<div style="...; ..." ...>

and sure enough, it fixed the problem! The moral of the story is that browsers have more consistent handling of valid HTML than of invalid HTML. So, fix your damn markup already! (Or use HTML Tidy to fix it.)

dan04
  • 77,360
  • 20
  • 153
  • 184
2

if & is used in html then you should escape it

If & is used in javascript strings e.g. an alert('This & that'); or document.href you don't need to use it.

If you're using document.write then you should use it e.g. document.write(<p>this &amp; that</p>)

Alex
  • 843
  • 6
  • 8
  • `document.write` should be avoided. See the warning box in http://www.w3.org/html/wg/drafts/html/master/dom.html#document.write%28%29 – Oriol Apr 07 '13 at 22:55
  • Good point about `document.write()`. But the over all point Alex is making about writing to the document from script stands, imo. +1 – Patrick M Aug 19 '13 at 17:32
1

If you're really talking about the static text

<title>Foo & Bar</title>

stored in some file on the hard disk and served directly by a server, then yes: it probably doesn't need to be escaped.

However, since there is very little HTML content nowadays that's completely static, I'll add the following disclaimer that assumes that the HTML content is generated from some other source (database content, user input, web service call result, legacy API result, ...):

If you don't escape a simple &, then chances are you also don't escape a &amp; or a &nbsp; or <b> or <script src="http://attacker.com/evil.js"> or any other invalid text. That would mean that you are at best displaying your content wrongly and more likely are suspectible to XSS attacks.

In other words: when you're already checking and escaping the other more problematic cases, then there's almost no reason to leave the not-totally-broken-but-still-somewhat-fishy standalone-& unescaped.

Joachim Sauer
  • 278,207
  • 54
  • 523
  • 586
  • 2
    I didn't downvote but, if I had to guess, I'd say you were downvoted because your answer (while intelligent) is a little bit of a mismatch with the question. He's not asking about escaping user input. He has control over the characters and is basically asking "If it does what I want, is it really important to follow the language spec to the letter?" I.e., he knows that there's a & because he put it in. – Matt Aug 16 '10 at 14:59
  • @Matt: I see, and that would be reasonable. I was just assuming that no one writes entirely static HTML pages any more and that pretty much all content is at least somewhat dynamic (usually based on some database content). Maybe that assumption should have been made explicit. – Joachim Sauer Aug 16 '10 at 15:15
0

The link has a fairly good example of when and why you may need to escape & to &amp;

https://jsfiddle.net/vh2h7usk/1/

Interestingly, I had to escape the character in order to represent it properly in my answer here. If I were to use the built-in code sample option (from the answer panel), I can just type in &amp; and it appears as it should. But if I were to manually use the <code></code> element, then I have to escape in order to represent it correctly :)

mathin
  • 11
  • 1