4

I'm currently experimenting with delivering XHTML5. Currently I deliver XHTML 1.1 Strict on the page I'm working on. That is I do for capable browsers. For those who don't accept XML encoded data I fall back to HTML4.1 strict.

In experimenting with using HTML5 for either, when delivering as HTML5 all works more or less as expected. The first issue I have when delivering as XHTML5 however is with the HTML entities. FF4 sais ü is an undefined entity. Because there is no HTML5 DTD.

I read that the HTML5 wiki currently recommends:

Do not use entity references in XHTML (except for the 5 predefined entities: &, <, >, " and ')

I do need <, > at certain places. Hence my Question is what is the best way in PHP to decode all but the five entities named above. html_entity_decode() decodes all of them, so is there a reasonable way to exclude some?

UPDATE:

I went with a simple replace / replace back approach for the moment, so unless there really is an elegant way the question is solved enough for my immediate needs.

function non_html5_entity_decode($string)
{
    $string = str_replace("&",'@@@AMP',
                        str_replace("'",'@@@APOS',
                        str_replace("<",'@@@LT',
                        str_replace(">",'@@@GT',
                        str_replace(""",'@@@QUOT',$string)))));
    $string = html_entity_decode($string);
    $string = str_replace('@@@AMP',"&",
                        str_replace('@@@APOS',"'",
                        str_replace('@@@LT',"<",
                        str_replace('@@@GT',">",
                        str_replace('@@@QUOT',""",$string)))));
    return $string;
}
C.O.
  • 2,179
  • 6
  • 24
  • 48

2 Answers2

3

PAY ATTENTION on universal convertions: the use of html_entity_decode with default parameters not remove all named entities, only the few defined by old HTML 4.01 standard. So entities like ©(©) will by converted; but some like +(+), not. To convert ALL named entities use the ENT_HTML5 in the second parameter (!).

Also, if destination encode not is UTF8, can not recive the superior (to 255) names, like 𝒜(𝒜) thar is 119964>255.

So, to convert "ALL POSSIBLE NAMED ENTITIES", you MUST use html_entity_decode($s,ENT_HTML5,'UTF-8') but it is valid only with PHP5.3+, where the flag ENT_HTML5 was implemented.

In the particular case of this question, must use also flag ENT_NOQUOTES instead the default ENT_COMPAT, so , must use html_entity_decode($s,ENT_HTML5|ENT_NOQUOTES,'UTF-8')


PS (edited): thanks to @BoltClock to remember about PHP5.3+.

Peter Krauss
  • 11,340
  • 17
  • 129
  • 247
  • Of course, `ENT_HTML5` is only available in PHP 5.4, which wasn't even available at the time this question was first asked. If you're still on an older version of PHP, you will have to find a workaround. – BoltClock Aug 10 '13 at 03:31
  • Ops, sorry, I discovered now, after error and testing... Well, I will not delete because now the page have a "explicit solution" for "fast readers" like me. Thanks @BoltClock. – Peter Krauss Aug 10 '13 at 03:39
  • That's OK - the answer may be helpful for future readers. I'm just saying that the feature is relatively new, so it may not benefit certain people. – BoltClock Aug 10 '13 at 03:40
  • The server with the project this was for is not yet PHP 5.3 But let's look to the future. Thank you for reviving this one. – C.O. Aug 10 '13 at 17:48
0

I think a html_entity_decode() followed by a htmlspecialchars() is the easiest way to go.

It won't convert ' though - to get that, you'd have to do htmlspecialchars() first, and then convert ' into &apos.

Pekka
  • 418,526
  • 129
  • 929
  • 1,058
  • That would not work for me. I'm using output buffers and I want to get the buffer's contents at the very end and do the replacement. If I used `htmlspecialchars()` I'd encode all of my source. I want to preserve those entities where they already exist. – C.O. Jun 17 '11 at 20:57
  • thanks for trying to help, see above what I eventually used to get on with testing the XHTML5 output. You're answer is sensible for data that comes from a DB etc. so I'm accepting it. – C.O. Jun 17 '11 at 21:54
  • This sugestion not solves the problem. See `htmlspecialchars(" { &")`, ok, will preserve `&`, but destroy `{`, that can not converted after, by html_entity_decode. – Peter Krauss Aug 10 '13 at 04:14