6

THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
  PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.

THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.

Illustrating the problem

Suppose

  $xmlFrag ='<p>Hello world! &#160;&#160; Let A&lt;B and A=&#x222C;dxdy</p>';

Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),

$xmlFrag = '<p>Hello world!    Let A&lt;B and A=∬dxdy</p>';

The text "A<B" needs an XML-reserved character, so MUST stay as A&lt;B.


Frustrated solutions

I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,

  $s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
                                                        // as I expected

Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.


I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,

  function xml_entity_decode($s) {
    // here an illustration (by user-defined function) 
    // about how the hypothetical PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 

    //$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+

    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
  }  // you see? not need a benchmark: 
     //  it is not so fast as direct use of html_entity_decode; if there 
     //  was an XML-safe option was ideal.

PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.

Community
  • 1
  • 1
Peter Krauss
  • 11,340
  • 17
  • 129
  • 247
  • Your XML fragment there is already well formed XML - why are you trying to decode it? It *looks* like you're [trying to solve a different problem to the one you have](http://blogs.msdn.com/b/ericlippert/archive/2003/11/03/a-parable.aspx). – Rowland Shaw Aug 05 '13 at 11:23
  • I **need a fast build-in function**, perhaps html_entity_decode() without bugs, and I illustrated the function with a user-defined function. – Peter Krauss Aug 05 '13 at 15:06
  • `html_entity_decode` does what I'd expect it to do, given your input - hence why I think the issue is why you think you need to decode it? – Rowland Shaw Aug 05 '13 at 15:50
  • @RowlandShaw, the question is not, directly, about `html_entity_decode`, is about "where the PHP build-in function that do this?"... So, html_entity_decode was my guess, and I showed how is frustrating to try to use it in that context. I edited the question (check if introduction is better) to emphatise the problem, sorry my difficulty to express it in english. PS: perhaps there are no such build-in function, so my dream is see PHP5.6's html_entity_decode with an option to do this simple and imoportant task. – Peter Krauss Aug 06 '13 at 10:08
  • 1
    So it sounds like you want the method to transform the XML to something semantically identical, but without using entities where possible? In which case, I suspect that the method isn't there, as it *shouldn't* be needed - any XML parser reading the XML should treat your two fragments exactly the same (assuming the UTF-8 encoding doesn't get mangled/misrepresented on the way) – Rowland Shaw Aug 06 '13 at 11:31
  • Yes, it is, "to transform the XML to something semantically identical, but without using entities where possible". But, about utitily, see question: I MUST save (or interchange) the file as UTF8, is not for an "expert tool that have your DOM internal representation, and loads any thing". It is a real problem and a real limitation of PHP. – Peter Krauss Aug 06 '13 at 11:37
  • Pay attention, as [commented here](http://stackoverflow.com/a/20124022/287948), my solution `xml_entity_decode()` works fine and need 1/10 of the time of non-native workaround... REPEATING: the problem here is not my function, is the **absence of a PHP-buildin function/parameter that solves the problem**. – Peter Krauss Nov 10 '14 at 16:14

6 Answers6

4

This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.

... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:

The best workaround

Pay attention:

  1. The function xml_entity_decode() below is the best (over any other) workaround.
  2. The function below is not an answer to the present question, it is only a workwaround.
  function xml_entity_decode($s) {
  // illustrating how a (hypothetical) PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
 }  

To test and to demonstrate that you have a better solution, please test first with this simple benckmark:

  $countBchMk_MAX=1000;
  $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
  $start_time = microtime(TRUE);
  for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){

    $A = xml_entity_decode($xml); // 0.0002

    /* 0.0014
     $doc = new DOMDocument;
     $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
     $doc->encoding = 'UTF-8';
     $A = $doc->saveXML();
    */

  }
  $end_time = microtime(TRUE);
  echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
     ($end_time  - $start_time)/$countBchMk_MAX, 
     " seconds</h1>";
  
Community
  • 1
  • 1
Peter Krauss
  • 11,340
  • 17
  • 129
  • 247
2

Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:

$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);
m13r
  • 1,853
  • 2
  • 22
  • 33
Alf Eaton
  • 4,344
  • 3
  • 35
  • 43
  • Yes, is not a solution for my problem (I not have the DTDs), but is a good solution for people that is working with complete XML+DTD sets... And not need performance. My problem is (as stated) "I need a XML file *full encoded* by UTF8 ... and, I need a build-in function that do it fast". Do you have some benchmark to compare perforances of your load/save workaround with mine `xml_entity_decode()` ? – Peter Krauss Nov 21 '13 at 15:04
  • Hello... 1 year without any benchmark? Ok, I do: my **`xml_entity_decode()` need 0.0002 seconds** to convert a big XML string. Your "`loadXML()` and `saveXML()`" needs **0.0014 seconds** to convert the same XML string. So *your solution needs ~10 times more* than `xml_entity_decode()`... So, **it is not a solution** (!). – Peter Krauss Nov 10 '14 at 16:07
  • @PeterKrauss As long as there are no unusual named entities in the XML, you can leave out the libxml flags and not load the DTD (although if it's JATS XML you're working with, you probably do want to load the DTD, even if it makes things slower). The important part is to add `$doc->encoding = 'UTF-8';` before saving the XML. Does that make the benchmark more acceptable? – Alf Eaton Nov 10 '14 at 17:07
  • @AlfEaton, please test your assertion in a simple fragment with a named character entity, as `  ` . Without DTD loadXML() raises an error... So it is not a solution to the described problem. – Peter Krauss Mar 18 '16 at 13:13
  • @PeterKrauss The described problem doesn't include ` `. If you're using named character entities, then you need to load a DTD that maps them to Unicode codepoints. – Alf Eaton Mar 21 '16 at 14:12
  • @AlfEaton, sorry, you correct, it was only a PS comment in my description, "Must be ENT_HTML5 flag, for convert really all named entities"... Well, your solution is good but, even for this restricted scope, its performance is bad, and no news at PHP7... I think it is a case to submit a [PHP RFC](https://wiki.php.net/rfc). – Peter Krauss Mar 21 '16 at 14:24
2

I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.

Note: I'm assuming default encoding is UTF-8

// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

/* <Foo>€&amp;foo Ç</Foo> */

Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:

// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);

/* <Foo>&#8364;&amp;foo &#199;</Foo> */

In your case you want it the other way around. Encode numbered entities as UTF-8:

// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);

/* <Foo>€&amp;foo Ç</Foo> */

Benchmark

I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.

<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>

Your method

php -r '$q=["&amp;","&gt;","&lt;"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é &amp; ∬</Foo>
=====
Time taken: 2.0397531986237

My method

php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é #_x_amp#; &#8748;</Foo>
=====
Time taken: 4.045273065567

My method (with unicode to numbered entity):

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>&#8364;&amp;foo &#199; &#233; #_x_amp#; &#8748;</Foo>
=====
Time taken: 5.4407880306244

My method (with numbered entity to unicode):

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.5400078296661
aross
  • 2,557
  • 3
  • 29
  • 33
1
    public function entity_decode($str, $charset = NULL)
{
    if (strpos($str, '&') === FALSE)
    {
        return $str;
    }

    static $_entities;

    isset($charset) OR $charset = $this->charset;
    $flag = is_php('5.4')
        ? ENT_COMPAT | ENT_HTML5
        : ENT_COMPAT;

    do
    {
        $str_compare = $str;

        // Decode standard entities, avoiding false positives
        if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
        {
            if ( ! isset($_entities))
            {
                $_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));

                // If we're not on PHP 5.4+, add the possibly dangerous HTML 5
                // entities to the array manually
                if ($flag === ENT_COMPAT)
                {
                    $_entities[':'] = '&colon;';
                    $_entities['('] = '&lpar;';
                    $_entities[')'] = '&rpar';
                    $_entities["\n"] = '&newline;';
                    $_entities["\t"] = '&tab;';
                }
            }

            $replace = array();
            $matches = array_unique(array_map('strtolower', $matches[0]));
            for ($i = 0; $i < $c; $i++)
            {
                if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
                {
                    $replace[$matches[$i]] = $char;
                }
            }

            $str = str_ireplace(array_keys($replace), array_values($replace), $str);
        }

        // Decode numeric & UTF16 two byte entities
        $str = html_entity_decode(
            preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
            $flag,
            $charset
        );
    }
    while ($str_compare !== $str);
    return $str;
}
ganji
  • 502
  • 3
  • 13
  • Please see and analyse `my xml_entity_decode()` function, it is ok (!). The problem is not my function, it works, the problem is PHP (where the "native/buildin function"?). About your function: if the behaviour of your `xml_convert()` is not exactly the same, **it is wrong**: please check your function, correct it if necessary... And, next, say in what it differ with mine `xml_entity_decode()`. – Peter Krauss Nov 10 '14 at 15:33
  • 1
    Excuse me Peter for my opps!.I have edited my last answer.It is a replacement for html_entity_decode() , I hope it be useful. In html_entity_decode() it is not technically correct to leave out the semicolon at the end of an entity most browsers will still interpret the entity correctly. html_entity_decode() does not convert entities without semicolons, so in this function is left little solution that can be help full for reviewing in your challenge. – ganji Nov 11 '14 at 15:37
0

For those coming here because your numeric entity in the range 128 to 159 remains as numeric entity instead of being converted to a character:

echo xml_entity_decode('&#128;');
//Output &#128; instead expected €

This depends on PHP version (at least for PHP >=5.6 the entity remains) and on the affected characters. The reason is that the characters 128 to 159 are not printable characters in UTF-8. This can happen if the data to be converted mix up windows-1252 content (where € is the € sign).

Thomas Lauria
  • 789
  • 1
  • 6
  • 15
-1

Try this function:

function xmlsafe($s,$intoQuotes=1) {
if ($intoQuotes)
     return str_replace(array('&','>','<','"'), array('&amp;','&gt;','&lt;','&quot;'), $s);
else
     return str_replace(array('&','>','<'), array('&amp;','&gt;','&lt;'), html_entity_decode($s));
}

example usage:

echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';

also: https://stackoverflow.com/a/9446666/2312709

this code used in production seem that no problems happened with UTF-8

Community
  • 1
  • 1