-1

I'm writing a routine that saves large numbers to a file, but instead of writing the actual number as a string (eg. 999999), I'd like to use its equivalent UNICODE character (eg. ), regardless of whether it actually corresponds to a visible or recognizable character. Excluding surrogate pairs, does anyone know which numerical values correspond to a SINGLE Unicode character? I'm asking this since I noticed that certain numerical values correspond to a two-character Unicode code point. Ex. 999999 corresponds to , whereas 999998 corresponds to .

  • 5
    This sounds like an XY problem. Why treat the values as text at all if you don't care whether they're *really* text? Why not just write it all as a binary file? Handling arbitrary numbers as if they're text sounds like a recipe for problems. – Jon Skeet Nov 27 '19 at 09:34
  • 1
    Why pretend it's Unicode at all? If you're going to use Unicode code points (not characters, those aren't contiguous) merely for their value, you won't end up with characters at all. It's then little extra effort to drop the pretense of them being characters and just write the code points (using as many bits as your largest number requires). Surrogates are tricky enough without dragging them into problem areas where they don't even arise! – Jeroen Mostert Nov 27 '19 at 09:45
  • I agree with @JonSkeet that you should just use a `StreamWriter` to write the numbers in binary. When opening the file it will be represented as glibberish anyways. – Prophet Lamb Nov 27 '19 at 09:45
  • @RenéCarannante: No, `StreamWriter` is still for text. Did you mean `BinaryWriter`? – Jon Skeet Nov 27 '19 at 09:53
  • @JonSkeet aye sir – Prophet Lamb Nov 27 '19 at 09:55
  • Unicode consists of both one byte and two byte characters. The digits 0 to 9 are always one byte characters. I believe the hex values 0x00 to 0x7F are all single byte characters. – jdweng Nov 27 '19 at 10:07
  • 3
    @jdweng: what you're saying applies to UTF-8 (which maintains compatibility with ASCII). There are different ways of encoding characters; in UTF-16 the characters for `0` through `9` occupy two bytes when encoded. "Unicode" is a broad term that encompasses the whole standard, which often causes confusion between characters (the things we see), code points (the numbers used to represent them) and encodings (the way those numbers ultimately end up as bits). – Jeroen Mostert Nov 27 '19 at 10:21
  • The character 0x00 to 0x7F are the same for all encoding. The encoding UTF8 (and others) map the one byte 0x80 to 0xFF to two byte UNICODE character to save memory. Once you get above 0xFF (two bytes) then every thing is unicode. – jdweng Nov 27 '19 at 10:32
  • Does this answer your question? [Why Unicode is restricted to 0x10FFFF?](https://stackoverflow.com/questions/52203351/why-unicode-is-restricted-to-0x10ffff) – phuclv Nov 27 '19 at 12:06
  • @jdweng: the world would be a much better place if that was true, but it's not. These characters are the same for all encodings that have ASCII as a common subset, which is true for all common encodings, but not true for all encodings. Examples of encodings that aren't ASCII compatible include GSM 03.38 (the encoding used to transmit short mobile messages) and various flavors of EBCDIC (still in use on IBM mainframes). Even as simple as DOS code page 875 (for Greek) is not ASCII compatible. Calling everything that isn't ASCII "Unicode" is a source of needless confusion. – Jeroen Mostert Nov 27 '19 at 12:11
  • Even among (mostly) ASCII compatible encodings, there is frequently disagreement on what the first 32 code points are used for -- they're control characters, or nothing, or even filled with other characters entirely (like code page 437, which assigned graphics to many of them). – Jeroen Mostert Nov 27 '19 at 12:29

1 Answers1

0

Unicode is currently defined to end at 10_ffff₁₆ = 1_114_111₁₀. Some languages are able to relax that restriction, e.g.

#!/usr/bin/env perl
"\x{7fff_ffff_ffff_ffff}";
# ÿ¿¿¿¿¿¿¿¿¿¿
encode "UTF8", "\x{7fff_ffff_ffff_ffff}";
# 0xff 0x80 0x87 0xbf 0xbf 0xbf 0xbf 0xbf 0xbf 0xbf 0xbf 0xbf 0xbf
daxim
  • 38,078
  • 4
  • 57
  • 123
  • 1
    wrong. Unicode is limited to U+10ffff, not U+1fffff. [Why Unicode is restricted to 0x10FFFF?](https://stackoverflow.com/q/52203351/995714) – phuclv Nov 27 '19 at 12:06
  • 1
    @phuclv You can just edit an answer to improve it. Care to remove the nugatory downvote? – daxim Nov 27 '19 at 12:53