0

Im web-requsting an XML document. Xdocument.Load(stream) throws an exception because the XML contains &, and therefore expects ; like &.

I did read the stream to string and replaced & with &, but that broke all other correctly encoded special chars like ø.

Is there a simple way to encode all disallowed chars in the string before parsing to XDocument?

Tomalak
  • 306,836
  • 62
  • 485
  • 598
espvar
  • 995
  • 5
  • 13
  • 27
  • 3
    How do you get the xml? how does it look like? – L.B Nov 30 '12 at 17:07
  • 2
    It cannot contain `ø`, because this character entity is not defined in XML. You are probably trying to read an HTML file. HTML and XML are not compatible. If you *are* in fact trying to read HTML, you should go with the [HTML Agility Pack](http://htmlagilitypack.codeplex.com/). – Tomalak Nov 30 '12 at 17:09
  • @L.B I use a c# System.Net WebRequest to an URL. The Xml contains lots of data and breaks because it contains & in clear text not & and not omitted by CDATA blocks. Eks: My text & his textThis is a special char æ – espvar Nov 30 '12 at 17:12
  • @Tomalak I tried to fix it by doing a string Replace("&", "&") That broke all special char like æ ø å. I just asumed that it contained htmlencoded values for these special chars. – espvar Nov 30 '12 at 17:16
  • 1
    @espvar Again: There *are* no special characters `ø` in XML. They are defined for HTML, not for XML. Even if the source would encode the `&` correctly, the resulting document would still be broken and unparsable. If you cannot fix the source, the HTML Agility Pack is the way to go. It contains a forgiving parser that will be able to produce a document object from broken input. – Tomalak Nov 30 '12 at 17:19
  • @espvar, !! I have an idea, but before that are you fine if your XML data has ` – InfantPro'Aravind' Nov 30 '12 at 17:24
  • example: Your data `some data ø` if looks like `` – InfantPro'Aravind' Nov 30 '12 at 17:25
  • The source does not wrap problem chars with . – espvar Nov 30 '12 at 17:28
  • @espvar! No!! Source need not do that !! You can do it now!! By simple algorithm! just let me know if it is fine! – InfantPro'Aravind' Nov 30 '12 at 17:29
  • step1: By using simple algorithm (not really simple to be honest), by using .net code you can wrap the text that has `&`!! like I showed above! no need to encode! just string replacement will do! – InfantPro'Aravind' Nov 30 '12 at 17:31
  • step2: once it is encoded with CDATA, you can load it as an XML!! you can pass it or store wherever you wish – InfantPro'Aravind' Nov 30 '12 at 17:31
  • step3: while extracting data from XML: extract the CDATA sections separately !! YOU ARE DONE!! – InfantPro'Aravind' Nov 30 '12 at 17:31
  • want to give a shot? If you are fine with this process then I will try otherwise don't wanna waste my effort! I understand that your problem here is loading the string as an XML! Moreover your XML has invalid encoded characters like `ø`!! this kind of data must be treated as CDATA no other way!! – InfantPro'Aravind' Nov 30 '12 at 17:34
  • I m fine with the CDATA wrap:) – espvar Nov 30 '12 at 17:37

3 Answers3

0

Try CDATA Sections in xml

A CDATA section can only be used in places where you could have a text node.

<foo><![CDATA[Here is some data including < , > or & etc) ]]></foo>
Anujith
  • 9,201
  • 6
  • 31
  • 47
0

This kind of methods are not encouraged!! The reason lies in your question!

(replacing & by &amp; turns &gt; to &amp;gt;)

The better suggestion apart from using regex is modifying your source code which is generating such uncoded XML.
I have come across (.NET) code that use 'string concat' to come up with XML! (Instead one should use XML-DOM)
If you have an access to modify the source code then better go head with that .. because encoding such half-encoded XML is not promised with perfection!

InfantPro'Aravind'
  • 11,372
  • 20
  • 76
  • 112
  • 1
    Thank you for your reply. The problem is that i dont have access to the source of the xml. I have alerted the source owner of the problem, but i fear it is gona take som time, and I need a quick fix. – espvar Nov 30 '12 at 17:18
0

@espvar,

This is an input XML:

<root><child>nospecialchars</child><specialchild>data&data</specialchild><specialchild2>You.. & I in this beautiful world</specialchild2>data&amp;</root>

And the Main function:

        string EncodedXML = encodeWithCDATA(XMLInput); //Calling our Custom function

        XmlDocument xdDoc = new XmlDocument();

        xdDoc.LoadXml(EncodedXML); //passed

The function encodeWithCDATA():

    private string encodeWithCDATA(string stringXML)
    {
        if (stringXML.IndexOf('&') != -1)
        {

            int indexofClosingtag = stringXML.Substring(0, stringXML.IndexOf('&')).LastIndexOf('>');
            int indexofNextOpeningtag = stringXML.Substring(indexofClosingtag).IndexOf('<');

            string CDATAsection = string.Concat("<![CDATA[", stringXML.Substring(indexofClosingtag, indexofNextOpeningtag), "]]>");

            string encodedLeftPart = string.Concat(stringXML.Substring(0, indexofClosingtag+1), CDATAsection);
            string UncodedRightPart = stringXML.Substring(indexofClosingtag+indexofNextOpeningtag);
            return (string.Concat(encodedLeftPart, encodeWithCDATA(UncodedRightPart)));
        }
        else
        {
            return (stringXML);
        }
    }

Encoded XML (ie, xdDoc.OuterXml):

<root>
  <child>nospecialchars</child>
  <specialchild>
    <![CDATA[>data&data]]>
  </specialchild>
  <specialchild2>
    <![CDATA[>You.. & I in this beautiful world]]>
  </specialchild2>
  <![CDATA[>data&amp;]]>
</root>

All I have used is, substring, IndexOf, stringConcat and recursive function call.. Let me know if you don't understand any part of the code.

The sample XML that I have provided possess data in the parent nodes as well, which is kind of HTML property .. ex: <div>this is <b>bold</b> text</div>.. and my code takes care of encoding data outside <b> tag if they have special character ie, &..

Please note that, I have taken care of encoding '&' only and .. data cannot have chars like '<' or '>' or single-quote or double-quote..

InfantPro'Aravind'
  • 11,372
  • 20
  • 76
  • 112