I think you are missing something important here. UTF-32 is an encoding for Unicode. Unicode actually fits within a 21 bit space. As the Unicode FAQ states:
"The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space."
Any UTF-32 "characters" that are outside of the Unicode code space are invalid ... and you should never see them in a UTF-32 encoded String
. So 2^15 longs should be enough.
In practice, you are unlikely to see code points outside of the Basic Linguistic Plane (plane 0). So it makes sense to use a bitmap for the BMP (i.e. codes up to 65535) and a sparse data structure (e.g. a HashSet<Integer>
) for the other panes.
You could also consider using BitSet
instead for "rolling your own" bit-set data structures using long
or long[]
.
Finally, I should not that some of the code in the Q&A that you linked to is NOT appropriate for looking for unique characters in UTF-16 for a couple of reasons:
The idea of using N variables of type long
and a switch statement does not scale. The code of the switch statement gets large and unmanageable ... and possibly gets bigger than the JVM spec can cope with. (The maximum size of a compiled method is 2^16 - 1 bytes of bytecode, so it clearly isn't viable for implementing a bit-vector for all of the Unicode code space.)
It is a better idea to use an array of long
and get rid of the need for a switch
... which is only really there because you have N distinct long
variables.
In UTF-16, each code unit (16 bit value) encodes either 1 code point (character) or half a code point. If you simply create a bitmap of the code units, you won't detect unique characters properly.