2

I have some columns in MYSQL table that will contain unique strings in different languages and there are scenarios where some characters will match e.g. English A with Swedish A and MYSQL will treat them as equal.

So I am considering using UTF8_bin collation for those columns as it seem to be more appropriate.

The problem is that with UTF_8_bin the unique indexes will not be case insensitive so having 'Andreas' and 'andreas' will be possible if both strings are of the same language.

Is there any workaround for this?

Solarflare
  • 8,861
  • 2
  • 14
  • 29
NullReference
  • 722
  • 1
  • 6
  • 19

1 Answers1

1

You can use a generated column that will store the lower case version of your string, and use a unique index on that.

create table test_utf8_bin_ci
( u8 varchar(50) charset utf8mb4 collate utf8mb4_unicode_ci,
  u8_bin_ci varchar(50) charset utf8mb4 collate utf8mb4_bin as (lower(u8)) unique
);

insert into test_utf8_bin_ci (u8)
values ('A'),('Ä'),('Å'),('Â'),('Á'),('À');

insert into test_utf8_bin_ci (u8)
values ('å');

Error Code: 1062. Duplicate entry 'å' for key 'u8_bin_ci'

For MySQL versions prior to 5.7.8, indexes on generated columns are not supported yet. So you would need to add a "normal" column (without the as (lower(u8))) and calculate the value in insert/update triggers. The unique index would work the same way as with calculated columns, it's just a little more to code.

create trigger trbins_test_u8_bin_ci before insert on test_u8_bin_ci
for each row 
  set new.u8_bin_ci = lower(new.u8);
create trigger trbupd_test_u8_bin_ci before update on test_u8_bin_ci
for each row 
  set new.u8_bin_ci = lower(new.u8);

If you use _bin, you should be aware that a lot of function will not work case insensitive anymore either, e.g.

select * 
from test_utf8_bin_ci 
where u8 = 'ä';

will not give you any results. To be able to use an index to search (which is not possible if you use e.g. where lower(u8) = lower('ä')), you can use

select * 
from test_utf8_bin_ci 
where u8_bin_ci = lower('ä');

That would mean that your queries would use different columns to compare and to update (which might require some further adaption if you use a framework), but workarounds wouldn't be called workarounds if they would be perfect.

Solarflare
  • 8,861
  • 2
  • 14
  • 29