1171

Between utf8_general_ci and utf8_unicode_ci, are there any differences in terms of performance?

simhumileco
  • 21,911
  • 14
  • 106
  • 90
KahWee Teng
  • 12,350
  • 3
  • 19
  • 21
  • 2
    See also http://stackoverflow.com/questions/1036454/what-are-the-diffrences-between-utf8-general-ci-and-utf8-unicode-ci – unor Aug 28 '12 at 20:24
  • 8
    If you like `utf8[mb4]_unicode_ci`, you _may_ like `utf8[mb4]_unicode_520_ci` even more. – Rick James Jan 27 '16 at 05:52
  • 9
    I don't know how I feel about that - instead of fixing their implementation to follow the latest Unicode standard they keep the obsolete version as the default and people have to add "520" to use the proper one now. And it's not forwards and backwards compatible because you can't use the "520" version on older MySQL versions. Why couldn't they have just updated their existing collation? Same with "mb4", really. What code really depended on the old, limited/obsolete behaviour to justify keeping that as the default? – thomasrutter Sep 21 '17 at 23:41
  • 9
    Still better is 8.0's default of `utf8mb4_0900_ai_ci`. – Rick James Jan 05 '19 at 00:41
  • And 8.0 sped up utf8 comparisons significantly. (Probably all collations of utf8/utf8mb4) – Rick James Feb 07 '21 at 03:06

8 Answers8

1752

For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example, utf8mb4_0900_ai_ci.

All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared.

_unicode_ci and _general_ci are two different sets of rules for sorting and comparing text according to the way we expect. Newer versions of MySQL introduce new sets of rules, too, such as _0900_ai_ci for equivalent rules based on Unicode 9.0 - and with no equivalent _general_ci variant. People reading this now should probably use one of these newer collations instead of either _unicode_ci or _general_ci. The description of those older collations below is provided for interest only.

MySQL is currently transitioning away from an older, flawed UTF-8 implementation. For now, you need to use utf8mb4 instead of utf8 for the character encoding part, to ensure you are getting the fixed version. The flawed version remains for backward compatibility, though it is being deprecated.

Key differences

  • utf8mb4_unicode_ci is based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages.

  • utf8mb4_general_ci is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed. It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations, such as when using particular languages or characters.

    On modern servers, this performance boost will be all but negligible. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.

Benefits of utf8mb4_unicode_ci over utf8mb4_general_ci

utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.

As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_ci sorting in MySQL, but there are still a few differences:

  • For examples, the Unicode collation sorts "ß" like "ss", and "Œ" like "OE" as people using those characters would normally want, whereas utf8mb4_general_ci sorts them as single characters (presumably like "s" and "e" respectively).

  • Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. utf8mb4_unicode_ci handles these properly.

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8mb4_general_ci sorting. The suitability of utf8mb4_general_ci will depend heavily on the language used. For some languages, it'll be quite inadequate.

What should you use?

There is almost certainly no reason to use utf8mb4_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by other bottlenecks than this.

In the past, some people recommended to use utf8mb4_general_ci except when accurate sorting was going to be important enough to justify the performance cost. Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously.

There's an argument to be made that if speed is more important to you than accuracy, you may as well not do any sorting at all. It's trivial to make an algorithm faster if you do not need it to be accurate. So, utf8mb4_general_ci is a compromise that's probably not needed for speed reasons and probably also not suitable for accuracy reasons.

One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

What the parts mean

Firstly, ci is for case-insensitive sorting and comparison. This means it's suitable for textual data, and case is not important. The other types of collation are cs (case-sensitive) for textual data where case is important, and bin, for where the encoding needs to match, bit for bit, which is suitable for fields which are really encoded binary data (including, for example, Base64). Case-sensitive sorting leads to some weird results and case-sensitive comparison can result in duplicate values differing only in letter case, so case-sensitive collations are falling out of favor for textual data - if case is significant to you, then otherwise ignorable punctuation and so on is probably also significant, and a binary collation might be more appropriate.

Next, unicode or general refers to the specific sorting and comparison rules - in particular, the way text is normalized or compared. There are many different sets of rules for the utf8mb4 character encoding, with unicode and general being two that attempt to work well in all possible languages rather than one specific one. The differences between these two sets of rules are the subject of this answer. Note that unicode uses rules from Unicode 4.0. Recent versions of MySQL add the rulesets unicode_520 using rules from Unicode 5.2, and 0900 (dropping the "unicode_" part) using rules from Unicode 9.0.

And lastly, utf8mb4 is of course the character encoding used internally. In this answer I'm talking only about Unicode based encodings.

thomasrutter
  • 104,920
  • 24
  • 137
  • 160
  • 234
    @KahWeeTeng You should **never, *ever*** use `utf8_general_ci`: it simply doesn’t work. It’s a throwback to the bad old days of ASCII stooopeeedity from fifty years ago. Unicode case-insensitive matching cannot be done without the foldcase map from the UCD. For example, “Σίσυφος” has three different sigmas in it; or how the lowercase of “TSCHüẞ” is “tschüβ”, but the uppercase of “tschüβ” is “TSCHÜSS”. You can be right, or you can be fast. Therefore you must use `utf8_unicode_ci`, because if you don’t care about correctness, then it’s trivial to make it infinitely fast. – tchrist Mar 15 '12 at 19:24
  • 2
    “utf8_general_ci comes very close to correct Unicode sorting in many languages, but has a number of inaccuracies in some languages.”: is there also an impact on character classes, I mean in practice, does it impact things like `LTRIM`/`RTRIM`? – Hibou57 Aug 18 '13 at 13:57
  • 1
    That's a good question, that I can't give a definitive answer to off the top of my head. I would assume that LTRIM and RTRIM *would* work the same, as would any identification purely on character classes. The two collations have the same characters and character ranges. Though, I will repeat my general view that there's really no reason to use utf_general_ci anymore, as the reasons it was created are no longer relevant in all but the most obscure use cases. – thomasrutter Aug 19 '13 at 15:17
  • @thomasrutter Out of curiosity, how did you learn so much about SQL? – bigpotato Aug 14 '14 at 16:40
  • 1
    I don't know that much about SQL, though I have been using MySQL for many years and like to spend time getting to know the internals of things. – thomasrutter Aug 14 '14 at 23:58
  • save my day... i wondered why the orm i used keeps generate schema in unicode, now i know. thanks for great explanation – Marcel Djaman Dec 29 '14 at 07:34
  • 11
    After reading this I also uncovered that utf8_unicode_ci will consider any characters with the same collation weight as equal for purposes of equality comparison. This leads to cases where `"か" == "が"` or `"ǽ" == "æ"`. For sorting this makes sense but could be surprising when selecting via equalities or dealing with unique indices - https://bugs.mysql.com/bug.php?id=16526 – Mat Schaffer Mar 13 '15 at 15:22
  • i moved your update to top since that is the most recent and relevant part of the information to handle tl;dr use-case. – DevZer0 Jun 19 '15 at 03:28
  • I appreciate the idea, but the main question was what are the differences between them, and they got pushed down with your edit. My recommendation for what to use was just a follow-on from that. I will come up with a better edit which I hope you'll like better. – thomasrutter Jun 19 '15 at 03:33
  • Updated to refer to the character encoding as utf8mb4 instead of utf8, as introduced in MySQL ~5.5.3. They are the same encoding but utf8mb4 removes a MySQL limitation restricting it to the BMP (first 65,536 codepoints only). – thomasrutter Jul 14 '15 at 00:36
  • utf8mb4 is a creative choice, perhaps I can even say a ridiculous one. As a dev you ought to choose compatible stuff and fix problems instead of creating them. Namely, utfmb4 won't work on many Plesk servers as they don't come with MySQL version better or equal to 5.5.3 preinstalled. The principle of "good enough" applies here, so use utf8_unicode_ci as your collation of choice. – Dan Horvat Oct 11 '15 at 12:51
  • 5
    @DanHorvat The only practical reason to limit yourself to MySQL's older, more limited subset of Unicode is if you have an old version of MySQL that doesn't support the more complete utf8mb4. 5.5.3 is over 5 years old. I appreciate that Plesk runs on a different MySQL schedule, but most distros are on MySQL 5.5 now and Plesk 11.x *does* support MySQL 5.5 if you update its components. – thomasrutter Oct 11 '15 at 23:05
  • @thomasrutter Yes, that's exactly what I said. Unless there's a real practical reason for using utf8mb4 (the extra stuff is really vital to your project - you're a linguist or something), you should stick with what works for most environments. Everything else, e.g. using utf8mb4 because it's "the best" is actually bad practice. Whoever is doing it is a bad dev. OVH Plesk servers with CentOS 6.6 use MySQL 5.1.73. "Most people" also do not use old versions of IE, yet the good devs take old IE into account when designing websites. Anyway, I downvoted this answer. – Dan Horvat Oct 12 '15 at 08:26
  • 27
    I would disagree that using the newer, more standards-complaint variant is a bad practice, and I think it's inflammatory to call people bad developers over something like this. You may also want to note that my answer as it stands says "*in new versions* of MySQL use utf8mb4, rather than utf8", emphasis mine. – thomasrutter Oct 12 '15 at 14:13
  • 30
    @DanHorvat `utf8mb4` is [the only correct choice](https://stijndewitt.com/2015/06/15/use-mysql-utf8mb4-if-you-want-full-unicode-support/). With `utf8` you are stuck in some MySQL-only, 3-byte variant of UTF8 that only MySQL (and MariaDB) know what to do with. The rest of the world is using UTF8, which can contain [up to 4 bytes per character](https://stijndewitt.com/2014/08/09/max-bytes-in-a-utf-8-char/). The MySQL devs misnamed their homebrew encoding `utf8` and to not break backward compatibility, they now have to refer to the real UTF8 as `utf8mb4`. – Stijn de Witt Jun 14 '16 at 17:32
  • @StijndeWitt I don't disagree. But try doing a simple thing - migrate a site that was built with the most popular CMS (Wordpress) site to the second most popular panel (Plesk) on the most popular hosting provider (OVH) and see what happens. I don't think an engineering solution that doesn't work properly with the most popular products is a correct choice. Whatever the textbook says. The job of an engineer is to fix problems and not create them, as I stated above. – Dan Horvat Jun 15 '16 at 11:42
  • @ypercubeᵀᴹ no, `0xFFFD`. Old versions of MySQL didn't support characters outside the basic multilingual plane (BMP). – thomasrutter Oct 17 '16 at 03:20
  • Please have a look at your answer - it has been edited so that original references to `utf8` and `utf8mb4` have been mixed up. – rath Nov 08 '18 at 16:13
  • @rath I deliberately changed all instances of "utf8" to "utf8mb4". Since MySQL 5.5.3 (8 years ago now) there is no justification for using the buggy "utf8" variants, and "utf8" vs "utf8mb4" is not what's being discussed here, this is about "utf8mb4_general_ci" vs "utf8mb4_unicode_ci". I curse whoever made the choice to come up with the confusing name "utf8mb4" and not just fix the problem, but, well, I've had 8 years to come to terms with it. – thomasrutter Nov 09 '18 at 00:36
  • @thomasrutter I didn't even realize. Thank you for the clarification, you're right, these names are extremely confusing at first – rath Nov 09 '18 at 09:39
  • unrelated question, but will `ascii_bin` run slower than `ascii_general_ci`?? – oldboy Sep 29 '19 at 03:42
  • 2
    @BugWhisperer ascii_bin should run lightning fast. But you should use whatever is correct, not whatever is fast but incorrect. ascii_bin will almost never be correct if you are storing actual text, only for alphanumeric codes and things – thomasrutter Sep 30 '19 at 04:49
  • im using it purely for numbers in some cases and in other cases numbers and latin alphabet. so `bin` is faster than `general_ci`? – oldboy Oct 01 '19 at 05:48
  • I can't say specifically how they've optimised it, but bin should require less computation because it doesn't have to normalise case. You would normally choose based on whether you need case insensitivity. – thomasrutter Oct 07 '19 at 02:48
  • @tchrist searching for `Yilmaz` matches `Yılmaz` with collaction `utf8mb4_general_ci`, but it won't match it with collation `utf8_unicode_ci` for some reason. – Adam Dec 06 '19 at 12:44
  • Old thread but a question: If I have legacy databases (or even new ones) and the field is not being used to search or sort data, does it matter if the encoding is `utf8_general_ci` rather than updating to `utf8mb4_general_ci` – Steve Jun 15 '20 at 23:38
  • MySQL's "utf8" prevents use of characters outside the Basic Multilingual Plane such as emoji and mathematical alphanumeric characters. Converting the character set of the column would allow those. That's the main difference as otherwise the collation will be about the same (note there are newer collations than "*_unicode_ci") – thomasrutter Jul 07 '20 at 00:33
  • Changing `utf8_general_ci` to `utf8_unicode_ci` in a column used for storing user's session id made my stored procedure work. It was all fine in localhost but not on a live server. – yendrrek Nov 15 '20 at 22:51
182

I wanted to know what is the performance difference between using utf8_general_ci and utf8_unicode_ci, but I did not find any benchmarks listed on the internet, so I decided to create benchmarks myself.

I created a very simple table with 500,000 rows:

CREATE TABLE test(
  ID INT(11) DEFAULT NULL,
  Description VARCHAR(20) DEFAULT NULL
)
ENGINE = INNODB
CHARACTER SET utf8
COLLATE utf8_general_ci;

Then I filled it with random data by running this stored procedure:

CREATE PROCEDURE randomizer()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE random CHAR(20) ;
  theloop: loop
    SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36);
    INSERT INTO test VALUES (i+1, random);
    SET i=i+1;
    IF i = 500000 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END

Then I created the following stored procedures to benchmark simple SELECT, SELECT with LIKE, and sorting (SELECT with ORDER BY):

CREATE PROCEDURE benchmark_simple_select()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE Description = 'test' COLLATE utf8_general_ci;
    SET i = i + 1;
    IF i = 30 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

CREATE PROCEDURE benchmark_select_like()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE Description LIKE '%test' COLLATE utf8_general_ci;
    SET i = i + 1;
    IF i = 30 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

CREATE PROCEDURE benchmark_order_by()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE ID > FLOOR(1 + RAND() * (400000 - 1))
    ORDER BY Description COLLATE utf8_general_ci LIMIT 1000;
    SET i = i + 1;
    IF i = 10 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

In the stored procedures above utf8_general_ci collation is used, but of course during the tests I used both utf8_general_ci and utf8_unicode_ci.

I called each stored procedure 5 times for each collation (5 times for utf8_general_ci and 5 times for utf8_unicode_ci) and then calculated the average values.

My results are:

benchmark_simple_select()

  • with utf8_general_ci: 9,957 ms
  • with utf8_unicode_ci: 10,271 ms

In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 3.2%.

benchmark_select_like()

  • with utf8_general_ci: 11,441 ms
  • with utf8_unicode_ci: 12,811 ms

In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 12%.

benchmark_order_by()

  • with utf8_general_ci: 11,944 ms
  • with utf8_unicode_ci: 12,887 ms

In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 7.9%.

Alessio Cantarella
  • 4,659
  • 3
  • 21
  • 29
nightcoder
  • 12,219
  • 16
  • 61
  • 72
  • 19
    Nice benchmark, thanks for sharing. I'm getting sensibly similar figures (MySQL v5.6.12 on Windows): 10%, 4%, 8%. I concur: the performance gain of `utf8_general_ci` is just too minimal to be worth using. – RandomSeed Sep 15 '13 at 12:58
  • 10
    1) But shouldn't this benchmark generate similar results for the two collation by definition? I mean `CONV(FLOOR(RAND() * 99999999999999), 20, 36)` generates only ASCII, and no Unicode characters to be processed by the algorithms of the collations. 2) `Description = 'test' COLLATE ...` and `Description LIKE 'test%' COLLATE ...` only process a single string ("test") at runtime, don't they? 3) In real apps, columns used in ordering would probably be indexed, and indexing speed on different collations with real non-ASCII text might differ. – Halil Özgür Oct 12 '14 at 23:29
  • 2
    @HalilÖzgür - your point is partially wrong. I guess it's not about the codepoint value to be outside ASCII (which general_ci would handle correctly), but about specific features, like treating umlauts written as "Uml*ea*ute" or some such subtleties. – Tomasz Gandor Apr 14 '15 at 12:50
50

This post describes it very nicely.

In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results.

informatik01
  • 15,174
  • 9
  • 67
  • 100
Michael Madsen
  • 51,727
  • 6
  • 69
  • 80
10

See the mysql manual, Unicode Character Sets section:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

So to summarize, utf_general_ci uses a smaller and less correct (according to the standard) set of comparisons than utf_unicode_ci which should implement the entire standard. The general_ci set will be faster because there is less computation to do.

Dana the Sane
  • 13,774
  • 8
  • 51
  • 77
  • 18
    There is no such thing as “slightly less correct”. Correctness is a boolean characteristic; it does not admit modifiers of degree. Just use `utf8_unicode_ci` and pretend the buggy broken version doesn’t exist. – tchrist Mar 15 '12 at 19:27
  • 2
    I had problems getting 5.6.15 to take the collation_connection setting, and it turns out you have to pass it in the SET line like 'SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'. Credit goes to Mathias Bynens for the solution, here's his very useful guide: http://mathiasbynens.be/notes/mysql-utf8mb4 – Steve Hibbert Jan 26 '14 at 14:57
  • 5
    @tchrist The problem with saying correctness is boolean is it doesn't take into account situations that don't rely on absolute correctness. Your underlying point isn't invalid nor am I attempting to espouse the benefits of general_ci, but your general statement about correctness is easily disproven. I do it on a daily basis in my profession. Comedy aside, Stuart has a good point [here](https://www.youtube.com/watch?v=F_1zoX5Ax9U). – Anthony Nov 19 '15 at 01:41
  • 5
    With geolocation or game development we trade correctness with performance all the time. And of course correctness is a real number between `0` and `1`, not a bool. :) E.G. selecting geo points in a bounding box is an approximation of 'points nearby' which is not as good as calculating the distance between the point and the reference point and filtering on that. But *both* are an approximation and in fact, complete correctness is mostly not achievable. See the [coastline paradox](https://en.wikipedia.org/wiki/Coastline_paradox) and [IEEE 754](https://en.wikipedia.org/wiki/IEEE_floating_point) – Stijn de Witt Jun 14 '16 at 17:45
  • 4
    **TL;DR**: Please provide a program that prints the *correct* result for `1/3` – Stijn de Witt Jun 14 '16 at 17:47
  • As far as there being no such thing as "slightly less correct", I would have to disagree. If I'm holding a peach, and someone asks me what I'm holding, I can answer "a peach", which would be correct. I could also answer "a piece of fruit", which would be correct, but slightly less correct. I could answer "food", which would, again, be correct, but slightly less so. See how being overly anal can be obnoxious? I do. – Jeffrey Tackett Mar 31 '17 at 00:48
9

In brief words:

If you need better sorting order - use utf8_unicode_ci (this is the preferred method),

but if you utterly interested in performance - use utf8_general_ci, but know that it is a little outdated.

The differences in terms of performance are very slight.

simhumileco
  • 21,911
  • 14
  • 106
  • 90
7

Some details (PL)

As we can read here (Peter Gulutzan) there is difference on sorting/comparing polish letter "Ł" (L with stroke - html esc: Ł) (lower case: "ł" - html esc: ł) - we have following assumption:

utf8_polish_ci      Ł greater than L and less than M
utf8_unicode_ci     Ł greater than L and less than M
utf8_unicode_520_ci Ł equal to L
utf8_general_ci     Ł greater than Z

In polish language letter Ł is after letter L and before M. No one of this coding is better or worse - it depends of your needs.

Kamil Kiełczewski
  • 53,729
  • 20
  • 259
  • 241
2

There are two big difference the sorting and the character matching:

Sorting:

  • utf8mb4_general_ci removes all accents and sorts one by one which may create incorrect sort results.
  • utf8mb4_unicode_ci sorts accurate.

Character Matching

They match characters differently.

For example, in utf8mb4_unicode_ci you have i != ı, but in utf8mb4_general_ci it holds ı=i.

For example, imagine you have a row with name="Yılmaz". Then

select id from users where name='Yilmaz';

would return the row if collocation is utf8mb4_general_ci, but if it is collocated with utf8mb4_unicode_ci it would not return the row!

On the other hand we have that a=ª and ß=ss in utf8mb4_unicode_ci which is not the case in utf8mb4_general_ci. So imagine you have a row with name="ªßi", then

select id from users where name='assi';

would return the row if collocation is utf8mb4_unicode_ci, but would not return a row if collocation is set to utf8mb4_general_ci.

A full list of matches for each collocation may be found here.

Adam
  • 15,564
  • 13
  • 95
  • 165
0

According to this post, there is a considerably large performance benefit on MySQL 5.7 when using utf8mb4_general_ci in stead of utf8mb4_unicode_ci: https://www.percona.com/blog/2019/02/27/charset-and-collation-settings-impact-on-mysql-performance/

DavidH
  • 49
  • 2