Map supplementary Unicode characters to BMP (if possible)

Question

I ran into the issue that my XML parser (VTD-XML) doesn't seem to be able to handle Unicode Supplementary characters (please correct if I'm already wrong here). It seems, the parser only uses the lower 16 bit of such characters.

I cannot switch to another parser within the project I'm occupied with. I am parsing Medline abstracts (https://www.ncbi.nlm.nih.gov/pubmed) and it seems there have been added documents that contain supplementary characters over the last year (e.g. https://www.ncbi.nlm.nih.gov/pubmed/?term=26855708, ends of results section).

As a quick and dirty fix I would just delete all characters above 0xFFFF from the documents. Obviously, that will destroy some expressions in the document texts and so I'm not really happy with that solution.

Since I can't change the parser, I was wondering if there exists some possibility to map supplementary characters to characters within the BMP that are likely to have a glyph with similar appearance, if existent.

Of course I welcome any other idea. It would even be fine to replace the supplementary characters with some kind of placeholder and then put the original character back in but this seems error prone. Better ideas?

Edit: Here is some - hopefully - minimal example of how this issue comes up with VTD-XML:

@Test
public void parseUnicodeBeyondBMP() throws NavException, FileNotFoundException, IOException, EncodingException, EOFException, EntityException, ParseException {
    // character codpoint 0x10400
    String unicode = "<supplementary>\uD801\uDC00</supplementary>";
    byte[] unicodeBytes = unicode.getBytes();
    assertEquals(unicode, new String(unicodeBytes, "UTF-8"));

    VTDGen vg = new VTDGen();
    vg.setDoc(unicodeBytes);
    vg.parse(false);
    VTDNav vn = vg.getNav();
    long fragment = vn.getContentFragment();
    int offset = (int) fragment;
    int length = (int) (fragment >> 32);
    String originalBytePortion = new String(Arrays.copyOfRange(unicodeBytes, offset, offset+length));
    String vtdString = vn.toRawString(offset, length);
    // this actually succeeds
    assertEquals("\uD801\uDC00", originalBytePortion);
    // this fails ;-( the returned character is Ѐ, codepoint 0x400, thus the high surrogate is missing
    assertEquals("\uD801\uDC00", vtdString);
}

how big a problem is it... I thought those characters are not very often used... — vtd-xml-author, Jan 23 '17 at 20:08
I might be able to get you some quick fix,, I will have to look into this... — vtd-xml-author, Jan 23 '17 at 23:52
Hey Jimmy, thanks a lot for the quick reply. The issue is as follows: We read XML with supplementary characters with VTD and then we store it. Until here, everything goes fine. But when we then try to read the XML, that we have stored with VTD before, then VTD would complain regarding invalid characters. Of course: The former supplementaries have been cut to 16 bits and thus are invalid. The original parsing did not throw errors but just truncated the character. — khituras, Jan 24 '17 at 07:03
I added a quite short example. VTD actually seems to get the character offsets right as you can see when I just get them myself from the original data. But using ```toRawString(int)``` or ```toString(int)``` returns 0x0400 instead of 0x10400. Thanks a lot for looking into this, a fix would help me greatly! — khituras, Jan 25 '17 at 07:27
That's great! I would have time until tomorrow, if you would think that would give you the time you need to fix more places — khituras, Jan 25 '17 at 09:25
Thank you so much. The jUnit test is working now. I am currently deploying the new version to our processing pipelines. I will report how that worked in a day or so. If this issue really should be gone now this would be a major help for me, thanks! — khituras, Jan 26 '17 at 17:51
I may have a complete fixed version up soon... will do a 2.13_2 release — vtd-xml-author, Jan 26 '17 at 20:38
I did use the fixed version immediately and did a complete processing. I can now confirm that everything is working as it should, your fix was very helpful for us, thank you! — khituras, Feb 23 '17 at 14:34
Thanks for confirmation. I can tell you there will a lot more changes related to this bug that will come out... it will require a lot of testing...I will have to shamelessly ask you to check out those changes when they come out — vtd-xml-author, Feb 23 '17 at 22:13

Map supplementary Unicode characters to BMP (if possible)

0 Answers0

Linked