0

I have a question about using the bit-vector approach that is common to finding whether a string has unique characters. I have seen those solutions out there (one of them) work well for ASCII and UTF-16 character set.

However, how will the same approach work for UTF-32? The longest continuous bit vector can be a long variable in Java right? UTF-16 requires 1024 such variables. If we take the same approach it will require 2^26 long variables (I think). Is it possible to solve for such a big character set using bit-vector?

Community
  • 1
  • 1
neer
  • 98
  • 2
  • 5
  • This question is very open ended; some code with a specific question would be better. I also disagree that utf-16 and utf-32 require different number of descriptors/vectors to fully describe the utf character set. – markspace Mar 15 '15 at 01:50
  • Within the context of your question, are these distinct: `a`, `a` followed by the Unicode combining "character" diaeresis, or `a` with a diaeresis ("aää", that's four codepoints)? – Tom Blodget Mar 15 '15 at 15:55
  • @TomBlodget Yes sort of. I am not too sure about what you mean by codepoints, but I am actually imagining this problem as comparing numerical byte value. So `a` and `ä`are not equal per se. – neer Mar 19 '15 at 06:20
  • A codepoint is an element of the "character" set; a number which may or may not have a "character" associated with it. Unicode has some codepoints termed "combining characters". A non-combining codepoint can be followed by any number of combining codepoints. Together, they form a grapheme. Unfortunately, Unicode has multiple representations of the same graphemes: "ä" vs "ä". Through normalization, you can convert "ä" (U+00E4) into "ä" (U+0061 U+0308) but then you'd have to account for that when comparing with "a" (U+0061). – Tom Blodget Mar 19 '15 at 14:28

2 Answers2

3

I think you are missing something important here. UTF-32 is an encoding for Unicode. Unicode actually fits within a 21 bit space. As the Unicode FAQ states:

"The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space."

Any UTF-32 "characters" that are outside of the Unicode code space are invalid ... and you should never see them in a UTF-32 encoded String. So 2^15 longs should be enough.

In practice, you are unlikely to see code points outside of the Basic Linguistic Plane (plane 0). So it makes sense to use a bitmap for the BMP (i.e. codes up to 65535) and a sparse data structure (e.g. a HashSet<Integer>) for the other panes.

You could also consider using BitSet instead for "rolling your own" bit-set data structures using long or long[].


Finally, I should not that some of the code in the Q&A that you linked to is NOT appropriate for looking for unique characters in UTF-16 for a couple of reasons:

  • The idea of using N variables of type long and a switch statement does not scale. The code of the switch statement gets large and unmanageable ... and possibly gets bigger than the JVM spec can cope with. (The maximum size of a compiled method is 2^16 - 1 bytes of bytecode, so it clearly isn't viable for implementing a bit-vector for all of the Unicode code space.)

    It is a better idea to use an array of long and get rid of the need for a switch ... which is only really there because you have N distinct long variables.

  • In UTF-16, each code unit (16 bit value) encodes either 1 code point (character) or half a code point. If you simply create a bitmap of the code units, you won't detect unique characters properly.

Stephen C
  • 632,615
  • 86
  • 730
  • 1,096
  • Agreed. The [W3](http://www.w3.org/International/articles/definitions-characters/#unicode) page describes Unicode as set encompassing all current and even ancient language characters. So I think many characters are likely than 65535. But I like your idea of bitmap for BMP and HashSet for others. – neer Mar 15 '15 at 07:17
2

Well, a long contains 64 bits of information, and the set of UTF-32 characters contains approximately 2^21 elements, which would require 2^21 bits of information. You would be right that it would require 2^26 long variables if the UTF-32 dataset used all bits. However, as it is, you only require 2^13 long variables (still a lot).

If you assume that the characters are evenly distributed over the dataset, this inefficiency is unavoidable and the best solution would be to use something else like a Set<Long> or something. However, English plaintext tends to have a majority of its characters in the ASCII range (0-127), and most Western languages are fairly constrained to a specific range, so you could use a bit vector for the high-frequency regions and a Set or other order-independent, high-efficiency contains data structure to represent the rest of the regions.

k_g
  • 4,019
  • 2
  • 20
  • 37