4

Is there any security risk in escaping other special characters but leaving ampersands untouched when displaying user-generated/submitted information? I'd like to let my user input html entities, hex, and decimal special characters freely without adding unnecessary complexity to my sanitizer.

jeremiahs
  • 3,147
  • 8
  • 22
  • 29

2 Answers2

5

tldr; Leaving in ampersands (or other "special characters") is not a security issue if coded correctly. That is, the output/use is of importance, not the input.

It all depends on how the data is used in the end. Doing a <input value="<? echo $input ?>" /> is not correctly coded, for arbitrary input, for instance.

Now an & is often much less of a "problem" than some other characters (say ', ", < or >), but it could cause some artifacts (including errors and undefined behavior) in some situations, or perhaps be used for adding an extra query parameter to a URL

  • .. but if the URL is not encoded as appropriate when output, then it's not correctly coded 1
  • .. and of course if a & is written verbatim into an XML/HTML stream, then it's not correctly coded 2
  • .. and if the program is passing in raw & [from user input] to a "shell string-execute" then it's [very likely] not correctly coded 3
  • .. it all comes down to use.

I tend to not alter the input, excepting to make it conform to business rules - and this does not include the above mentioned case! (But it may be a perfectly valid business rule to not accept an ampersand at all.)

Proper escaping (or, better yet, approaches that don't require [manual] escaping) at the appropriate times takes care of the rest and ensures that, through good coding of the usage, trivial attacks or accidental blunders are mitigated.

In fact, I would argue that this sort of "input sanitization" shows a lack of trust in the approaches/code used elsewhere and can lead to more problems with needing to undo the "sanitization". Magic quotes anyone?


1 This is a case of where an & in the user input can actually cause a form of injection. Imagine: format("http://site/view={0}", user_input), where user_input contains 1&buy=1. The result will be "http://site/view=1&buy=1". The correct method is to URI-encode (aka Percent encode) the value, which would have resulted in "http://site/view=1%26buy%3D1". (Note that there is only one query parameter in the correctly coded case. If the intent is to be able to allow "raw" input to be passed through, then carefully define/analyze the permissible rules and see the following paragraph.)

2 While a "bare" & can be valid in an HTML stream user input should not be relied upon as "being valid HTML". That is, regardless of targeting XML or HTML the correct output/rendering escaping mechanism should be used. (The escaping mechanism might choose to not encode "bare" &'s, but that is a secondary concern. The lazy programmer will continue to use the same escaping techniques for all applicable output to get consistent, reliable, and safe output.)

3 Instead of using a shell-execute that takes a single string of arguments that must be parsed, use an exec-form takes in a list of arguments. The latter [generally] prevents against spawning a shell and the associated shell-hacks. And, of course, never let the user manually specify the executable ..

  • 2
    +1 for stressing that user input should be escaped right before outputting it, every time, and that's it. – Judge Mental Jun 14 '12 at 06:23
  • Think a CMS scenario: I'm pulling an unclean post from the database and then outputting it into a webpage. What shenanigans could happen if I don't encode &s into &s? – jeremiahs Jun 15 '12 at 03:49
  • @jeremiahs They **should** be encode according to the context when they **are used** (e.g. output). However, `&` is "relatively" harmless, but the issue remains: either proper techniques are being used (in which case `&` is no different than ` –  Jun 15 '12 at 03:52
  • What specific security issues could that create? – jeremiahs Jun 15 '12 at 03:54
  • @jeremiahs The only one *I* can think of [now] relating to HTML is with links and being able to append additional query attributes (it won't allow injecting a script element, for example). However, it could still cause *incorrect display* of data while not being a "security issue" per-se. See Gumbo's answer for entity encoding in HTML and some of the nuances. –  Jun 15 '12 at 03:54
  • The relevant page is being coded in python 3.2. I don't believe there's a built-in way to encode/unencode html entities in python like there is in php (is there)? I'd like to avoid a large-ish external library like beautifulsoup if being lazy and allowing unencoded &s when displaying user content would allow users to enter html entities if they wanted to without causing security issues. – jeremiahs Jun 15 '12 at 03:58
  • I am sure that Python web programmers have found ways of cleanly dealing with sort of issue years ago. Consider asking (*after* searching for): "How can I safely output user text content (or whatever is appropriate) to an HTML page"? There is *nothing* new here. It's a *solved* problem. (Also, isn't beautifulsoup used for *consuming* webpages?) –  Jun 15 '12 at 03:59
  • Once again, **it doesn't matter if it's an `&` or ` –  Jun 15 '12 at 04:01
  • As far as I can tell, the Pythonic thing to do is escape everything, and force users to use actual string characters (copying and pasting em-dashes from some source, or using an arcane OS control key+numbers) if they want a symbol. I like HTML shorthand references to entities and want to allow users to use them wherever they like. I remain unable to find any method for encoding/decoding entities in this manner in the standard library, nor can I locate a small library dedicated just to this sort of problem. – jeremiahs Jun 16 '12 at 17:50
  • @jeremiahs A vital concept is missing here... which framework/stack is being used? I take it the Python program in question is some form of web app? –  Jun 17 '12 at 02:28
  • Cherrypy. But I'd be interested in an answer from any python framework's perspective, like Django or Pylons. – jeremiahs Jun 19 '12 at 15:38
  • Can't find anything either, eh? – jeremiahs Jun 23 '12 at 22:45
5

It all depends on the context the data is put into.

In HTML, the main reason to represent a plain & by a character reference is to avoid ambiguity as the & is also the begin of such a character reference. A popular example for such ambiguity is a plain & as part of a URL parameter in an HTML attribute like this:

<a href="/?lang=en&sect=foobar">

Here the & is not encoded appropriately with a corresponding character reference so the parser treats it as the begin of a character reference. And since sect is a known entity in HTML, representing the section character §, this attribute value is actually interpreted as /?lang=en§=foobar.

So leaving a plain & as it is does not prone an actual threat like other special characters in HTML do as they can change the context the data is put into:

  • the tag delimiters < and > can start or end a tag declaration,
  • the attribute value delimiters " and ' can start or end an attribute value declaration.

To be on the safe side, you should use htmlspecialchars with the double_encode parameter set to false to avoid a double encoding of already existing character references:

var_dump(htmlspecialchars('<"&amp;\'>', ENT_QUOTES, 'UTF-8', false) === '&lt;&quot;&amp;&#039;&gt;'); // bool(true)
Community
  • 1
  • 1
Gumbo
  • 594,236
  • 102
  • 740
  • 814
  • Odd. Firefox parses `&sect[non-word]` differently from `&sect[word]` .. I was sure the `§` form was *required*. –  Jun 14 '12 at 18:39
  • Yeah, odd behavior. It doesn't work as advertised in the post (in FF), however: `click me` Only `§` and `&sect-` are encoded in this case. I *do not* believe `&sect-` should be encoded though... –  Jun 14 '12 at 18:39
  • 1
    @pst The ending `;` is optional, although recommended to avoid ambiguity. SGML, the language HTML is based on, does allow to [declare arbitrary entities](http://stackoverflow.com/a/3488399/53114) within the document type declaration. So it would be possible to have an entity named *sect* and one named *section*; what would `&section` reference, *sect* or *section*? Thus the `;` should be used to make the reference distinct. Unfortunately, todays browsers don’t support this SGML feature as their HTML parsers are quite different from a proper SGML parser. – Gumbo Jun 14 '12 at 19:33
  • @pst In your example you reference the entity *sect2* and three times *sect* as an [entity’s name](http://www.is-thought.co.uk/book/sgml-6.htm#General) can only consist of alphanumeric characters while starting with a alphabetic character. Any character that does not fulfill this criteria ends the entity name. – Gumbo Jun 14 '12 at 19:46
  • What about `&sect=`? Per this answer I was expecting that to behave like `§` and `&sect-` but it does not (in FF or Safari, I get `http://f/1&sect2§3§-4;&sect=5` as the results)... perhaps a browser feature to prevent query strings from breaking? Thanks for the clarification about the `;`, that makes sense. –  Jun 14 '12 at 19:58
  • Also, even with proper encoding `&sect=` would be "read" as `&sect=` and thus no better than (an undecoded) `&sect=`? –  Jun 14 '12 at 20:01
  • 2
    @pst This exact behavior is actually what is specified by [HTML 5](http://www.w3.org/TR/html5/tokenization.html#consume-a-character-reference): “If the character reference is being consumed as part of an attribute, and the last character matched is not a "`;`" (U+003B) character, and the next character is either a "`=`" (U+003D) character or in the range ASCII digits, uppercase ASCII letters, or lowercase ASCII letters, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (`&`) must be unconsumed, and nothing is returned.” – Gumbo Jun 14 '12 at 20:25
  • @pst And, yes, the resulting parsed data is the same — at least in this case. But it can differ in other cases. And that’s why you should always encode the value appropriately to the context it is to be used in. – Gumbo Jun 14 '12 at 20:31
  • Thanks for the clarifications! –  Jun 14 '12 at 21:08