Accurate JSON text encoding detection

Question

In RFC4627 a method for identifying the Unicode encoding when a BOM is not present was described. This relied on the first 2 characters in the JSON text to always be ASCII characters. But in RFC7159, the specification defines JSON text as "ws value ws"; implying that a single string value would also be valid. So the first character will be the opening quote, but then any Unicode character allowed in a string could follow. Considering that RFC7159 also discourages the use of a BOM; and no longer describes the procedure for detecting the encoding from the first 4 octets (bytes), how should one detect it? UTF-32 should still work correctly as described in RFC4627, because the first character is four bytes and should still be ASCII, but what about UTF-16? The second (2-byte) character might not contain a zero-byte to help identify the correct encoding.

Interesting question. Certainly, determining the Unicode Scheme is more challenging. For example, a single byte, whose ASCII value is a digit, is also a valid JSON: its a single digit number in UTF-8. How about proposing a solution, and we can discuss your solution? — CouchDeveloper, Oct 21 '15 at 13:00
I'm afraid I don't have a proposal that will be reliable. Obviously one first check for a BOM just in case. Next check for UTF-32 - if the first 3 bytes are zero then it's UTF-32BE, else if the 3 bytes after the first is zeros, it's UTF-32LE. So far we should be able to rely on the test, but now comes the problem. To test for UTF-16, we still have to look at 4 bytes. If we could assume that the first 2 characters are always ASCII, then we test if the first and third bytes are zeros for UTF-16BE and the second and fourth for UTF-16LE. But if the second character can be non-ASCII, what then? — Jannie Gerber, Oct 21 '15 at 13:33
Would it help if I give you a headstart with an implementation in C++ for detecting the encoding? As far as I know you don't need the BOM detection - but I have one implemented in C++, too. — CouchDeveloper, Oct 21 '15 at 15:08
Anything that would give reliable detection would help, thanks. It can be pseudocode too; I just need the steps for what to test for. — Jannie Gerber, Oct 21 '15 at 18:58

score 3 · Accepted Answer · answered Oct 21 '15 at 15:39

After taking a look at an implementation which I made a couple years ago, I can tell that it's possible to unambiguously detect the given Unicode Scheme from just one character, given the following assumptions:

The input must be Unicode
The first character must be ASCII
There must be no BOM

Consider this:

Assuming the first character is a "[" (0x5B) - an ASCII. Then, we may get these byte patterns:

UTF_32LE:    5B 00 00 00  
UTF_32BE:    00 00 00 5B
UTF_16LE:    5B 00 xx xx
UTF_16BE:    00 5B xx xx
UTF_8:       5B xx xx xx

where "xx" is either EOF or any other byte.

We should also note, that according RFC7159, the shortest valid JSON can be just one character. That is, it can be possibly 1, 2 or 4 byte - depending of the Unicode Scheme.

So, if it helps, here is an implementation in C++:

namespace json {

    //
    //  Detect Encoding
    //
    // Tries to determine the Unicode encoding of the input starting at 
    // first. A BOM shall not be present (you might check with function 
    // json::unicode::detect_bom() whether there is a BOM, in which case 
    // you don't need to call this function when a BOM is present).
    //
    // Return values:
    // 
    //   json::unicode::UNICODE_ENCODING_UTF_8
    //   json::unicode::UNICODE_ENCODING_UTF_16LE
    //   json::unicode::UNICODE_ENCODING_UTF_16BE
    //   json::unicode::UNICODE_ENCODING_UTF_32LE
    //   json::unicode::UNICODE_ENCODING_UTF_32BE
    //
    //  -1:     unexpected EOF
    //  -2:     unknown encoding
    //
    // Note:
    // detect_encoding() requires to read ahead a few bytes in order to deter-
    // mine the encoding. In case of InputIterators, this has the consequences
    // that these iterators cannot be reused, for example for a parser.
    // Usually, this requires to reset the istreambuff, that is using the 
    // member functions pubseekpos() or pupseekoff() in order to reset the get 
    // pointer of the stream buffer to its initial position.
    // However, certain istreambuf implementations may not be able to set the    
    // stream pos at arbitrary positions. In this case, this method cannot be
    // used and other edjucated guesses to determine the encoding may be
    // needed.

    template <typename Iterator>    
    inline int 
    detect_encoding(Iterator first, Iterator last) 
    {
        // Assuming the input is Unicode!
        // Assuming first character is ASCII!

        // The first character must be an ASCII character, say a "[" (0x5B)

        // UTF_32LE:    5B 00 00 00
        // UTF_32BE:    00 00 00 5B
        // UTF_16LE:    5B 00 xx xx
        // UTF_16BE:    00 5B xx xx
        // UTF_8:       5B xx xx xx

        uint32_t c = 0xFFFFFF00;

        while (first != last) {
            uint32_t ascii;
            if (static_cast<uint8_t>(*first) == 0)
                ascii = 0; // zero byte
            else if (static_cast<uint8_t>(*first) < 0x80)
                ascii = 0x01;  // ascii byte
            else if (*first == EOF)
                break;
            else
                ascii = 0x02; // non-ascii byte, that is a lead or trail byte
            c = c << 8 | ascii;
            switch (c) {
                    // reading first byte
                case 0xFFFF0000:  // first byte was 0
                case 0xFFFF0001:  // first byte was ASCII
                    ++first;
                    continue;
                case 0xFFFF0002:
                    return -2;  // this is bogus

                    // reading second byte
                case 0xFF000000:    // 00 00 
                    ++first;
                    continue;
                case 0xFF000001:    // 00 01
                    return json::unicode::UNICODE_ENCODING_UTF_16BE;
                case 0xFF000100:    // 01 00
                    ++first;
                    continue;
                case 0xFF000101:    // 01 01
                    return json::unicode::UNICODE_ENCODING_UTF_8;

                    // reading third byte:    
                case 0x00000000:  // 00 00 00
                case 0x00010000:  // 01 00 00  
                    ++first;
                    continue;                    
                    //case 0x00000001:  // 00 00 01  bogus
                    //case 0x00000100:  // 00 01 00  na
                    //case 0x00000101:  // 00 01 01  na
                case 0x00010001:  // 01 00 01 
                    return json::unicode::UNICODE_ENCODING_UTF_16LE;

                    // reading fourth byte    
                case 0x01000000:
                    return json::unicode::UNICODE_ENCODING_UTF_32LE;
                case 0x00000001:
                    return json::unicode::UNICODE_ENCODING_UTF_32BE;

                default:
                    return -2;  // could not determine encoding, that is,
                                // assuming the first byte is an ASCII.
            } // switch
        }  // while 

        // premature EOF
        return -1;
    }
}

Thanks for the code. My original code followed the procedure described in RFC4627 that looks at the first 4 bytes as follows: 00 00 00 xx UTF-32BE, 00 xx 00 xx UTF-16BE, xx 00 00 00 UTF-32LE, xx 00 xx 00 UTF-16LE, xx xx xx xx UTF-8. With RFC 7159 allowing a single string value, one could get something like 00 xx xx 00 or xx 00 00 xx as well when UTF-16 is used. So I just changed my as follows: 00 00 00 xx UTF-32BE, xx 00 00 00 UTF-32LE, 00 xx UTF-16BE, xx 00 UTF-16LE, all else UTF-8. As far as I can see, your code should work as is. — Jannie Gerber, Oct 21 '15 at 23:30

Accurate JSON text encoding detection

1 Answers1

Linked