-2

I'd like to test some program on whether it can recognize Unicode chars and sort them correctly.

Can anybody provide some examples Unicode chars whose raw char representation will be sorted differently from the Unicode representation? Thanks.

user1424739
  • 7,204
  • 10
  • 38
  • 67
  • Sort order is not an inherent property of characters (except possibly in the sense of byte ordering). To sort Unicode, you need a Unicode collation algorithm. This means that there is no answer to your question, except with respect to whichever particular algorithm you're using, which is usually very language-specific. Even if there were such a definitive answer, resource requests are off-topic on SO. – Flimzy Sep 25 '19 at 14:55
  • Put it in another way what characters are sorted differently with LC_COLLATE=C vs LC_COLLATE=UTF8 by coreutils `sort`? This is about software testing. It should be relevant to SO. – user1424739 Sep 25 '19 at 15:00
  • That is indeed much more specific. I suggest updating your question with that. – Flimzy Sep 25 '19 at 15:02
  • What is "raw char representation"? The bytes in memory? In that case, it depends on the encoding form. In UTF-16LE, all characters are sorted differently from their raw byte encoding. – Mr Lister Sep 25 '19 at 17:29

1 Answers1

-1
>>> from pyuca import Collator
>>> sorted(["cafe", "caff", "café"])
['cafe', 'caff', 'café']
>>> sorted(["cafe", "caff", "café"], key=Collator().sort_key)
['cafe', 'café', 'caff']
daxim
  • 38,078
  • 4
  • 57
  • 123