How do I compare unicode strings containing non-english characters to sort alpabetically?

Question

I am trying to sort array/lists/whatever of data based upon the unicode string values in them which contain non-english characters, I want them sorted correctly alphabetically.

I have written a lot of code (D2010, win XP), which I thought was pretty solid for future internationalisation, but it is not. Its all using unicodestring (string) data type, which up until now I have just been putting english characters into the unicode strings.

It seems I have to own up to making a very serious unicode mistake. I talked to my German friend, and tried out some German ß's, (ß is 'ss' and should come after S and before T in alphabet) and and ö's etc (note the umlaut) and none of my sorting algorithms work anymore. Results are very mixed up. Garbage.

Since then I have been reading up extensively and learnt a lot of unpleasant things with regards to unicode collation. Things are looking grim, much grimmer than I ever expected, I have seriously messed this up. I hope I am missing something and things are not actually quite as grim as they appear at present. I have been tinkering around looking at windows api calls (RtlCompareUnicodeString) with no success (protection faults), I could not get it to work. Problem with API calls I learnt is that they change on various newer windows platforms, and also with delphi going cross plat soon, with linux later, my app is client server so I need to be concerned about this, but tbh with the situation being what is it (bad) I would be grateful for any forward progress, ie win api specific.

Is using win api function RtlCompareUnicodeString to obvious solution? If so I should really try again with that but tbh I have been taken aback by all of the issues involved with unicode collation and I not clear at all what I should be doing to compare these strings this way anyway.

I learnt of the IBM ICU c++ opensource project, there is a delphi wrapper for it albeit for an older version of ICU. It seems a very comprehensive solution which is platform independant. Surely I cannot be looking at creating a delphi wrapper for this (or updating the existing one) to get a good solution for unicode collation?

I would be extremely glad to hear advice at two levels :-

A) A windows specific non portable solution, I would be glad off that at the moment, forget the client server ramifications! B) A more portable solution which is immune from the various XP/vista/win7 variations of unicode api functions, therefore putting me in good stead for XE2 mac support and future linux support, not to mention the client server complications.

Btw I dont really want to be doing 'make-do' solutions, scanning strings prior to comparison and replacing certain tricky characters etc, which I have read about. I gave the German examplle above, thats just an example, I want to get it working for all (or at least most, far east, russian) languages, I don't want to do workarounds for a specific language or two. I also do not need any advice on the sorting algorithms, they are fine, its just the string comparison bit that's wrong.

I hope I am missing/doing something stupid, this all looks to be a headache.

Thank you.

EDIT, Rudy, here is how I was trying to call RtlCompareUnicodeString. Sorry for the delay I have been having a horrible time with this.

program Project26

{$APPTYPE CONSOLE}

uses
  SysUtils;


var
  a,b:ansistring;

  k,l:string;
  x,y:widestring;
  r:integer;

procedure RtlInitUnicodeString(
  DestinationString:pstring;
  SourceString:pwidechar) stdcall; external 'NTDLL';

function RtlCompareUnicodeString(
  String1:pstring;
  String2:pstring;
  CaseInSensitive:boolean
  ):integer stdcall; external 'NTDLL';


begin

  x:='wef';
  y:='fsd';

  RtlInitUnicodeString(@k, pwidechar(x));
  RtlInitUnicodeString(@l, pwidechar(y));

  r:=RtlCompareUnicodeString(@k,@l,false);

  writeln(r);
  readln;

end.

I realise this is most likely wrong, I am not used to calling api unctions directly, this is my best guess.

About your StringCompareEx api function. That looked really good, but is avail on Vista + only, I'm using XP. StringCompare is on XP, but that's not Unicode!

To recap, the basic task afoot, is to compare two strings, and to do so based on the character sort order specified in the current windows locale.

Can anyone say for sure if ansicomparetext should do this or not? It don't work for me, but others have said it should, and other things i have read suggest it should.

This is what I get with 31 test strings when using AnsiCompareText when in German Locale (space delimited - no strings contain spaces) :-

arß Asß asß aßs no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz

EDIT 2.

I am still keen to hear if I should expect AnsiCompareText to work using the locale info, as lkessler has said so, and lkessler has also posted about these subjects before and seems have been through this before.

However, following on from Rudy's advice I have also been checking out CompareStringW - which shares the same documentation with CompareString, so it is NOT non-unicode as I have stated earlier.

Even if AnsiCompareText is not going to work, although I think it should, the win32api function CompareStringW should indeed work. Now I have defined my API function, and I can call it, and I get a result, and no error... but i get the same result everytime regardless of the input strings! It returns 1 everytime - which means less than. Here's my code

var
  k,l:string;

function CompareStringW(
  Locale:integer;
  dwCmpFlags:longword;
  lpString1:pstring;
  cchCount1:integer;
  lpString2:pstring;
  cchCount2:integer
  ):integer stdcall; external 'Kernel32.dll';

begin;

  k:='zzz';
  l:='xxx';

  writeln(length(k));
  r:=comparestringw(LOCALE_USER_DEFAULT,0,@k,3,@l,3);

  writeln(r); // result is 1=less than, 2=equal, 3=greater than
  readln;

end;

I feel I am getting somewhere now after much pain. Would be glad to know about AnsiCompareText, and what I am doing wrong with the above CompareStringW api call. Thank you.

EDIT 3

Firstly, I fixed the api call to CompareStringW myself, I was passing in @mystring when I should do PString(mystring). Now it all works correctly.

r:=comparestringw(LOCALE_USER_DEFAULT,0,pstring(k),-1,pstring(l),-1);

Now, you can imagine my dismay when I still got the same sort result as I did right at the beginning...

arß asß aßs Asß no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz

You may also imagine my EXTREME dismay not to mention simultaneous joy when I realised the sort order IS CORRECT, and IT WAS CORRECT RIGHT BACK IN THE BEGGINING! It make sme sick to say it, but there was never any problem in the first place - this is all down to my lack of German knowledge. I beleived the sort was wrong, since you can see above string start with S, then later they start with ß, then s again and back to ß and so on. Well I can't speak German however I could still clearly see that they was not sorted correctly - my German friend told me ß comes after S and before T... I WAS WRONG! What is happening is that string functions (both AnsiCompareText and winapi CompareTextW) are SUBSTITUTING every 'ß' with 'ss', and every 'ö' with a normal 'o'... so if i take those result above and to a search and replace as described I get...

arss asss asss Asss no no o on oo oo ooo ooo op po ss SS ssass ssbss sss Sssa Sssb ssss ssss SSSS ssssss ssssss SSssss ssz sszss z zzz

Looks pretty correct to me! And it always was.

I am extremely grateful for all the advice given, and extremely sorry to have wasted your time like this. Those german ß's got me all confused, there was never nothing wrong with the built in delphi function or anything else. It just looked like there was. I made the mistake of combining them with normal 's' in my test data, any other letter would have not have created this illusion of un-sortedness! The squiggly ß's have made me look a fool! ßs!

Rudy and lkessler we're both especially helpful, ty both, I have to accept lkessler's answer as most correct, sorry Rudy.

Thanks Ian. I just spent 4 days messing around with this! Next time I won't laugh quite so loud when I find my cat chasing it's own tail.... — csharpdefector, Aug 14 '11 at 02:11
@csharpdefector: Glad you were able to figure it out. And thank you for your detailed question and followups, which will help others when they have the same question. Some of the answers to my questions on StackOverflow have told me that my understanding was wrong, and this correction of my thinking is even more valuable to me than simply getting my answer. And the wonderful thing with StackOverflow is, when you're totally stumped, you often get your answer within days or even hours. Fantastic. (Yes I'm an SO booster) — lkessler, Aug 14 '11 at 18:00

Rudy Velthuis · Answer 1 · 2011-08-13T16:51:30.373

You said you had problems calling Windows API calls yourself. Could you post the code, so people here can see why it failed? It is not as hard as it may seem, but it does require some care. ISTM that RtlCompareUnicodeStrings() is too low level.

I found a few solutions:

Non-portable

You could use the Windows API function CompareStringEx. This will compare using Unicode specific collation types. You can specify how you want this done (see link). It does require wide strings, i.e. PWideChar pointers to them. If you have problems calling it, give a holler and I'll try to add some demo code.

More or less portable

To make this more or less portable, you could write a function that compares two strings and use conditional defines to choose the different comparison APIs for the platform.

Those look like good suggestions Rudy, ty. Yes I will post the code, and try the above in about 90 mins, I have to go out now. — csharpdefector, Aug 13 '11 at 17:00

score 7 · Accepted Answer · edited May 23 '17 at 10:29

7

Try using CompareStr for case sensitive, or CompareText for case insensitive if you want your sorts exactly the same in any locale.

And use AnsiCompareStr for case sensitive, or AnsiCompareText for case insensitive if you want your sorts to be specific to the locale of the user.

See: How can I get TStringList to sort differently in Delphi for a lot more information on this.

edited May 23 '17 at 10:29

Community

1
1

answered Aug 13 '11 at 16:45

lkessler

19,414
31
125
196

I am sure you are right in your comment to HeartWare. I expected AnsiCompareText to work in the first place. I just realised, as I have the Windows Language Bar floating on screen. When I run my app the lang bar suddenly reverts user locale back to English, even tho my user locale, and I even changed my system locale and rebooted - are both in German. I have no project settings that will change any codepage or locale. I suspect this is the problem. I noticed you had the same prob in a previous question about internationalisation you asked. V frustrating! User Locale won't stay in Ger locale! – csharpdefector Aug 13 '11 at 20:43
Forgot that I had english as default to it went english anytime I start a new app - whoops. Im now in german locale, and my ß's are still not getting sorted correctly. – csharpdefector Aug 13 '11 at 21:22

score 2 · Answer 3 · edited Oct 08 '14 at 17:07

In Unicode the numeric order of the characters is certainly not the sorting sequence. AnsiCompareText as mentioned by HeartWare does take locale specifics into consideration when comparing characters, but, as you found out, does nothing wrt the sorting order. What you are looking for is called the collation sequence of a language, which specifies the alphabetic sorting order for a language taking diacritics etc into consideration. They were sort of implied in the old Ansi Code pages, though those didn't account for sorting difference between languages using the same character set either.

I checked the D2010 docs. Apart from some TIB* components I didn't find any links. C++ builder does seem to have a compare function that takes collation into account, but that's not much use in Delphi. There you will probably have to use some Windows' API functions directly.

Docs:

Sorting collate all out: http://www.siao2.com/2008/12/06/9181413.aspx
Collation terminology: http://msdn.microsoft.com/en-us/library/ms143726(SQL.90).aspx (though that pertains to MS SQL 2005, it may be helpful)

The 'Sorting "Collate" all out' article is by Michael Kaplan, someone who has great in-depth knowledge of all things Unicode and all intricacies of various languages. His blog has been invaluable to me when porting from D2006 to D2009.

score 1 · Answer 4 · answered Aug 13 '11 at 15:38

1

Have you tried AnsiCompareText ? Even though it is called "Ansi", I believe it calls on to an OS-specific Unicode-able comparison routine...

It should also make you safe from cross-platform dependencies (provided that Embarcadero supplies a compatible version in the various OS's they target).

I do not know how good the comparison works with the various strange Unicode ways to encode strings, but try it out and let us know the result...

answered Aug 13 '11 at 15:38

HeartWare

5,860
2
22
28

Yes I tried that with much hope, it's no good. A-Z fine but my ß's etc get sorted wrong :( – csharpdefector Aug 13 '11 at 15:48
2

With AnsiCompareText, if your locale is Germany your ß's should sort correctly, but they may not if your locale is anything else. – lkessler Aug 13 '11 at 17:25

How do I compare unicode strings containing non-english characters to sort alpabetically?

4 Answers4

Non-portable

More or less portable