Thursday, November 22, 2012

Character sets, Collations, UTF-8 and all that

Yesterday at the first Swedish MySQL User Group real meeting here in Stockholm, I presented a talk on character sets, collations and stuff like that. If you read this blog, you know that I have written about this before, but the presentation I did yesterday was a fair bit more detailed. You can view the full presentation on slideshare:
 One thing I talked a lot on was collations and how they affect matters, and this has more of an impact than you think, in particular when using UTF-8. You would think that using UTF-8 most character set problems are solved (at least when using 4-byte UTF-8), but no. Collations are still added to this, and there are many of them and the effect of choosing the wrong one can be real bad.

Let me take an example. You would think think that using a UNIQUE or PRIMARY KEY on a text-based column (using something like a VARCHAR or CHAR type) in a table would ensure that any two strings are unique, but that two strings values that are different may coexist in two different rows. Think again.

A collation defines how characters in a character set are sorted and compared. And most localized collations have some weird attributes to them. There are things that linguistics think are reasonable for a particular language, and that are hence present in the UNICODE standard, but it might not be widely accepted by the community at large. So back to my original example. Let's say we are in Sweden, then 4 (yes, four) different collations may be applicable:
  • utf8 binary - This is a plain binary collation, comparisons are done on the binary value of the characters.
  • utf8_unicode - This is a pretty reasonable collations based on some generic compromise in UNICODE on how things are sorted, and are not sorted across the globe. Sort of.
  • utf8_general - This is a simplified, faster general variation compared to utf8_Unicode
  • utf8_swedish - This is a collations that is specific to Sweden with some interesting Swedish specifics.
So lets's see how this work in practice. Lets try a table that looks like this:
CREATE TABLE `utf8_table` (
  `swedishname` char(10) CHARACTER SET utf8

    COLLATE utf8_general_ci NOT NULL,
  PRIMARY KEY (`swedishname`)
) ENGINE=InnoDB;

What happens with this data:
INSERT INTO utf8_table VALUES('A');
INSERT INTO utf8_table VALUES('Ä');
In Sweden, these two are unique (the second A has an umlaut). In the rest of the world, these two are the same, so the above will not work, a PRIMARY KEY error will happen on the second row, despite the characters being different! So we try this instead:
CREATE TABLE `utf8_table` (
  `swedishname` char(10) CHARACTER SET utf8

    COLLATE utf8_swedish_ci NOT NULL,
  PRIMARY KEY (`swedishname`)
) ENGINE=InnoDB;

And with the same data:
INSERT INTO utf8_table VALUES('A');
INSERT INTO utf8_table VALUES('Ä');
And this works as it should work, both rows are inserted!

I will write another blog post on this soon, with some more examples eventually, but for now:
Cheers
/Karlsson
PS. I apologize if you have problems reading the above, it probably comes from the fact that there are embedded Swedish characters in the text :-(

3 comments:

Daniël van Eeden said...

The use of the Byte Order Mark (BOM) is not explained yet. The BOM can be inserted in the middle of the data, which is problematic for some software.

And most UTF encodings have a BE and LE version.


A B C D E F G H
8 ♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
7 ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
6
5
4
3
2 ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
1 ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖

Robert Hodges said...

Interesting post! It seems like a bug or at best dubious behavior that primary key uniqueness is evaluated based on the collation sequence rather than the character values of the key. This behavior conflates sorting with value, which does not seem correct.

Just my $0.02.

Anders Karlsson said...

Daniel!

Yes, I never goy to the BOM. I talked a bit about it, but it was not in the preso. Also, as long as you are on UTF-8, there is no need for it, only with UTF-16, UCS-2 and UTF-32 is this required (yet another reason to stay clear of those). I know Microsoft ignores this and sometimes puts a BOM in UTF-8 data also, but that is just plain wrong.

Robert:
No, this is not a bug at all. As the collations determins equality, then these should also be used to enforce uniqueness. As the strings Äman, Åman and Aman are considered the same, unless you have the swedish collation, then a PK contraint should not allow all these in a PK.
If you really want a behavior that ask for, use the utf8_bin collation.

/Karlsson