Processing a Byte Order Mark

Question

I've been writing code to process xml downloaded via a webservice API. I was going ok until one query had some mysterious characters before the root element.

After contacting the support, I got the following message... "The ABS.Stat APIs resultant XML output are UTF-8 compliant. These characters are a UTF-8 Byte Order Mark designed to identify the xml as UTF-8. Hope this helps."

Whilst waiting for their reply I continued with my programming by simply starting my DOM processing at the opening tag (first "<") with the following code...

Dim lgRootElementStart As Long
lgRootElementStart = InStr(1, hReq.ResponseText, "<")
Dim sgResponse As String
sgResponse = Mid(hReq.ResponseText, lgRootElementStart)

Dim xmlDoc As New MSXML2.DOMDocument
If Not xmlDoc.LoadXML(sgResponse) Then
    etc. etc. etc.

All seems to be well, the data is deciphered and displayed ok.

But now that I know what those characters are, is there anything I should do with those characters?
Or to put it another way, is there anything I can do with those characters to make my excel application more reliable? i.e. now that I know the XML is UTF-8, how should I process it differently?

What should I do if the BOM gives UTF-16?

I'm not sure what you mean here. Are you having any kind of trouble working with UTF-8 BOM? — Foxfire And Burns And Burns, Nov 25 '19 at 10:30
No trouble so far. Just nobody advised me to skip the BOM, I just did it all on my own. Just want to be sure I will have no trouble in the future too. — Mikey, Nov 25 '19 at 10:52
Possible duplicate of [What's the difference between UTF-8 and UTF-8 without BOM?](https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom) — Tom Brunberg, Nov 25 '19 at 11:01
I'm getting the feeling that excel is assuming UTF-8. So if the BOM says UTF-16 it won't work, I'm guessing I should write code to abort the processing of the xml if it contains a BOM and it's anything other than UTF-8. Is this a fair assumption? — Mikey, Nov 25 '19 at 11:26
To make sure you treat an HTTP response charset in right way, you may rely on the response [header `Content-Type:`](https://tools.ietf.org/html/rfc7231#section-3.1.1.5). In that case it should be `Content-Type: text/xml; charset=utf-8`. Then convert the response binary content from `.responseBody` to text using `ADODB.Stream` object. — omegastripes, Nov 25 '19 at 11:59

score 0 · Answer 1 · answered Nov 26 '19 at 00:59

Well it seems that the BOM is more a nuisance than helpful, but I placed code in my application to check that it is a UTF8 BOM if any characters before the xml root element are received. If it's not a UTF8 BOM then an error is thrown. I'm not expecting this to be a problem any more, but if I ever see the error then I will have to re-analyse what's going on. Hopefully that will never happen. Code is...

Public Const BOM_UTF8 As String = "ï»¿"

and

If lgRootElementStart > 1 Then
    If Left(hReq.ResponseText, lgRootElementStart - 1) = BOM_UTF8 Then
    Else
        Err.Raise ERROR_SHOULD_NEVER_HAPPEN, sFunctionName, _
            "Non UTF8 BOM found. " _
            & "BOM is ..." & ConvertToHex(Left(hReq.ResponseText, lgRootElementStart - 1)) _
            & ", correct BOM is ... " & ConvertToHex(BOM_UTF8)
    End If
End If

One quote from a link in the comments says..."Encodings should be known, not divined". Well with this code I know it's UTF8 if I get it.

Processing a Byte Order Mark

1 Answers1