39

I am using XML to share HTML content. AFAIK, I could embed the HTML either by:

  • Encoding it: I don't know if it is completely safe to use. And I would have to decode it again.

  • Use CDATA sections: I could still have problems if the content contains the closing tag "]]>" and certain hexadecimal characters, I believe. On the other hand, the XML parser would extract the info transparently for me.

Which option should I choose?

UPDATE: The xml will be created in java and passed as a string to a .net web service, were it will be parsed back. Therefore I need to be able to export the xml as a string and load it using "doc.LoadXml(xmlString);"

hjpotter92
  • 71,576
  • 32
  • 131
  • 164
alberto
  • 543
  • 1
  • 4
  • 7

11 Answers11

38

The two options are almost exactly the same. Here are your two choices:

<html>This is &lt;b&gt;bold&lt;/b&gt;</html>

<html><![CDATA[This is <b>bold</b>]]></html>

In both cases, you have to check your string for special characters to be escaped. Lots of people pretend that CDATA strings don't need any escaping, but as you point out, you have to make sure that "]]>" doesn't slip in unescaped.

In both cases, the XML processor will return your string to you decoded.

Ned Batchelder
  • 323,515
  • 67
  • 518
  • 625
  • 1
    The one reason I'd opt to not use CDATA, is that usually the majority of data doesn't require escaping, and it is a mess to see so many CDATA wrappers on text that needs no escaping. The first method means that occasionally you have HTML encodings, but the majority of the time you have nice clean text with no unnecessary wrapper. Of course this might be different depending on your typical data. – AaronLS Apr 01 '14 at 12:51
12

CDATA is easier to read by eye while encoded content can have end of CDATA markers in it safely — but you don't have to care. Just use an XML library and stop worrying about it. Then all you have to say is "Put this text inside this element" and the library will either encode it or wrap it in CDATA markers.

Quentin
  • 800,325
  • 104
  • 1,079
  • 1,205
7

CDATA for simplicity.

Mohamed
  • 3,092
  • 7
  • 25
  • 31
3

If you use CDATA, then you must decode it correctly (textContent, value and innerHTML are methods that will NOT return the proper data).

let us say that you use an xml structure similar to this:

<response>
    <command method="setcontent">
        <fieldname>flagOK</fieldname>
        <content>479</content>
    </command>
    <command method="setcontent">
        <fieldname>htmlOutput</fieldname>
        <content>
            <![CDATA[
            <tr><td>2013/12/05 02:00 - 2013/12/07 01:59 </td></tr><tr><td width="90">Rastreado</td><td width="60">Placa</td><td width="100">Data hora</td><td width="60" align="right">Km/h</td><td width="40">Direção</td><td width="40">Azimute</td><td>Mapa</td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 13:55</td><td align='right'>113</td><td align='right'>NE</td><td align='right'>40</td><td><a href="http://maps.google.com/maps?q=-22.6766,-50.2218&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.6766,-50.2218</a></td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 13:56</td><td align='right'>112</td><td align='right'>NE</td><td align='right'>23</td><td><a href="http://maps.google.com/maps?q=-22.6638,-50.2106&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.6638,-50.2106</a></td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 18:00</td><td align='right'>111</td><td align='right'>SE</td><td align='right'>118</td><td><a href="http://maps.google.com/maps?q=-22.7242,-50.2352&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.7242,-50.2352</a></td></tr>
            ]]>
        </content>
    </command>
</response>

in javascript, then you will decode by loading the xml (jquery, for example) into a variable like xmlDoc below and then getting the nodeValue for the 2nd occurence ( item(1) ) of the content tag

xmlDoc.getElementsByTagName("content").item(1).childNodes[0].nodeValue

or (both notations are equivalent)

xmlDoc.getElementsByTagName("content")[1].childNodes[0].nodeValue
tony gil
  • 9,063
  • 6
  • 72
  • 89
1

I don't know what XML builder you're using, but PHP (actually libxml) knows how to handle ]]> inside CDATA sections, and so should every other XML framework. So, I'd use a CDATA section.

Ionuț G. Stan
  • 160,359
  • 18
  • 179
  • 193
1

It makes sense to wrap HTML in CDATA. The HTML text will probably constitute on single value in XML.

So not wrapping it in CDATA will cause all xml parsers to read it as a part of the XML document. While it is easy to circumvent this problem while using the xml, why the extra headache?

If you want to actually parse the HTML into a DOM, then its better to read the HTML text, and setup a parser to read the test separately.

Hope that came out the way I intended it to.

jrharshath
  • 23,967
  • 32
  • 94
  • 126
1

Personally, I hate CDATA segments, so I'd use encoding instead. Of course, if you add XML to XML to XML then this would result in encoding over encoding over encoding and thus some very unreadable results. Why I hate CDATA segments? I wish I knew. Personal preference, mostly. I just don't like getting used to adding "forbidden characters" inside a special segment where they would suddenly be allowed again. It just confuses me when I see XML mark-up within a CDATA segment and it's not part of the XML surrounding it. At least with encoding I will see that it's encoded.

Good XML libraries will handle both encoding and CDATA segments transparently. It's just my eyes that get hurt.

Wim ten Brink
  • 24,763
  • 19
  • 72
  • 138
0

Encoding it will work fine and is reliable. You can encode encoded sections etc. without any difficulty.

Decoding will be done automatically by whatever XML parser is used to handle your encoded HTML.

Brian Agnew
  • 254,044
  • 36
  • 316
  • 423
0

i think the answer depends on what you are planning to do with the html content, and also what type of html content you plan to support.

Especially when it comes to included javascript, encoding often results in problems. CDATA definitely helps you there.

If you plan to use only small snippets (ie. a paragraph) and have a way to preprocess/filter it (because oyu dont want javascript or fancy things anyways), you will probably be better off with encoding or actually just putting it directly as subtree in the xml. You can then also post-process the html (ie filter style or onclick attributes). But this is definitely more work.

Niko
  • 5,753
  • 2
  • 33
  • 48
0

You can use combination of both. For example: you want to pass <h1>....</h1> in xml node you have use CDATA section to pass it. Contents inside <h1>...</h1> must be encoded to html entities like e.g. &lt;, for <. Encoding between tags will solve the problem of ]]> getting interprited as it gets converted to ]]&gt; and html tags do not contain ]]>.

You can do this only if html is generated by yourself.

Xinus
  • 26,861
  • 26
  • 111
  • 160
0

If your HTML is well-formed, then just embed the HTML tags without escaping or wrapping in CDTATA. If at all possible, it helps to keep your content in XML. It gives you more flexibility for transforming and manipulating the document.

You could set a namespace for the HTML, so that you could disambiguate your HTML tags from the other XML wrapping it.

Escaped text means that the entire HTML block will be one big text node. Wrapping in CDATA tells the XML parser not to parse that section. It may be "easier", but limits your abilities downrange and should only be employed when appropriate; not just because it is more convenient. Escaped markup is considered harmful.

Mads Hansen
  • 53,910
  • 11
  • 106
  • 137
  • 1
    HTML isn't necessarily valid XML (for example, HTML dosen't require closing tags). They look alike because they share SGML as their common ancestor.The only option really is to escape the data, or use cdata. Otherwise the XML parser will crash when it finds the malformed markup. – Greg Dietsche Feb 25 '14 at 21:11