3

I get exception if in XElement's content I include characters such as '\x1A', '\x1B', '\x1C', '\x1D', '\x1E' or '\x1F'.

using System;
using System.Collections.Generic;
using System.Xml.Linq;

namespace LINQtoXMLInvalidChars
{
    class Program
    {
        private static readonly IReadOnlyCollection<char> InvalidCharactersInXml = new List<char>
        {
            '<',
            '>',
            '&',
            '\'',
            '\"',
            '\x1A',
            '\x1B',
            '\x1C',
            '\x1D',
            '\x1E',
            '\x1F'
        };

        static void Main()
        {
            foreach (var c in InvalidCharactersInXml)
            {
                var xEl = new XElement("tag", "Character: " + c);
                var xDoc = new XDocument(new XDeclaration("1.0", "utf-8", null), xEl);

                try
                {
                    Console.Write("Writing " + c + ": ");
                    Console.WriteLine(xDoc);
                }
                catch (Exception e)
                {
                    Console.WriteLine("Oops.    " + e.Message);
                }
            }

            Console.ReadKey();
        }
    }
}

In an answer from Jon Skeet to the question String escape into XML I read

You set the text in a node, and it will automatically escape anything it needs to.

So now I'm confused. Do I misunderstand something?

Some background information: The string content of the XElement comes from the end user. I see two options for making my application robust: 1) to Base-64 encode the string before passing it in to XElement 2) to narrow the accepted set of characters to e.g. alphanumeric characters.

Community
  • 1
  • 1
Gyula Kósa
  • 152
  • 1
  • 13

1 Answers1

4

Most of those characters simply aren't valid in XML 1.0 at all. Personally I wish that LINQ to XML would fail to produce a document that later it wouldn't be able to parse, but basically you should avoid them.

I would also recommend avoiding \x as an escape sequence anyway, preferring \u - the fact that \x will take "up to" 4 hex digits can be very confusing.

From the XML 1.0 spec:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now U+000D and U+000A are interesting cases - they won't be escaped in text nodes; they'll just be included verbatim. Whether or not that's then present when you parse the node will depend on parse settings (and whether there are non-whitespace characters around it).

In terms of how to handle this in your case: you definitely have options of:

  • Performing your own encoding/escaping. This is generally somewhat painful, and will lead to XML documents which are hard to read compared with regular ones. You could potentially do this only when required, adding an attribute to the element to say that you've done it, for example.
  • Detect and remove characters which are invalid in XML
  • Detect and reject strings containing characters which are invalid in XML

We can't really tell which of these is most appropriate in your scenario.

Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929
  • Jon, many thanks! I added some background information as a comment to the original question. I would be interested in your opinion on how application developers usually approach this situation. – Gyula Kósa Dec 17 '15 at 10:24
  • @GyulaKósa: The background information should be in the question itself, not a comment. I'll amend my answer appropriately. – Jon Skeet Dec 17 '15 at 10:26
  • I would like my serialization technology chosen (XML in this case) to be _transparent_ to the user of my component, so I would like to avoid _rejecting_ inputs just because they contain characters that are invalid in XML. _Removing_ characters is not an option to me unfortunately. There remains performing my _own encoding/escaping_ that really seems to be painful. Currently I am using [this](http://stackoverflow.com/a/10380166/1986541) approach to Base64-encode the byte representation of my string, but I have reservations because of the potential problem of endianness, for example. – Gyula Kósa Dec 18 '15 at 17:24
  • 1
    @GyulaKósa: That's a terrible approach to encoding characters as bytes. You should at least use UTF-8 or something similar. Is the use of XML entirely enforced on you? Because if you're not using it as a textual representation really, it's a terribly inefficient approach... – Jon Skeet Dec 18 '15 at 20:44
  • Unfortunately the use of XML is enforced on me, but it does not really matter if the string is written out as is or encoded. [XmlConvert.EncodeName(String)](https://msdn.microsoft.com/en-us/library/system.xml.xmlconvert.encodename%28v=vs.110%29.aspx) is almost what I need, but it was meant for the _name_ of the element, not for the _content_. I get the impression that XML limits its use, and it is not really meant for storing plain vanilla `string` .NET objects. – Gyula Kósa Dec 19 '15 at 08:39
  • I did consider using [UTF-8](https://msdn.microsoft.com/en-us/library/system.text.encoding.utf8%28v=vs.110%29.aspx), but it has also limitation e.g. being unable to encode `\uD802`. – Gyula Kósa Dec 19 '15 at 08:49
  • 1
    @GyulaKósa: No, it can encode U+D802 just fine - as part of a surrogate pair. If you have a string including U+D802 without a corresponding low surrogate, that's simply not a well-formed Unicode string. You really need to think about what you're trying to support. There's a big difference between serializing "any well-formed Unicode string" and "any sequence of UTF-16 code units". – Jon Skeet Dec 19 '15 at 09:48
  • I think I wrongly assumed that ["a series of Unicode characters"](https://msdn.microsoft.com/en-us/library/system.string%28v=vs.110%29.aspx) i.e. a `String` object is always a well-formed Unicode string. – Gyula Kósa Dec 21 '15 at 10:59
  • Let's assume that the string is a well-formed Unicode string. Do you think that it's a good approach to encode it using `System.Text.Encoding.UTF8.GetBytes` and then `System.Convert.ToBase64String` **if** `System.Xml.XmlConvert.VerifyXmlChars` throws `XmlException`? Thanks. – Gyula Kósa Dec 21 '15 at 12:00
  • 1
    @Gyula: That MSDN documentation is unfortunate. A string is actually a sequence of UTF-16 code units. But yes, that approach should be okay - add an attribute to the element to indicate that it's base64. – Jon Skeet Dec 21 '15 at 13:43
  • Readers of this conversation may find reading [When is a string not a string?](http://codeblog.jonskeet.uk/2014/11/07/when-is-a-string-not-a-string/comment-page-1/) blog post useful. – Gyula Kósa Dec 29 '15 at 16:16