Home > Software engineering >  What is the difference between utf8mb4_0900_as_cs and utf8mb4_0900_bin?
What is the difference between utf8mb4_0900_as_cs and utf8mb4_0900_bin?

Time:08-22

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-collating-weights says

For all Unicode collations except the _bin (binary) collations, MySQL performs a table lookup to find a character's collating weight.

For _bin collations except utf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.

For utf8mb4_0900_bin, the weight is the utf8mb4 encoding bytes. The sort order is the same as for utf8mb4_bin, but much faster.

How does utf8mb4_0900_as_cs fit into this? The page does not mention it.

CodePudding user response:

For the short answer, I quote your own quote:

For all Unicode collations except the _bin (binary) collations, MySQL performs a table lookup to find a character's collating weight.

utf8mb4_0900_as_cs is not _bin. So it does the same as all other collations.

For the slightly longer answer, I assume that you assume that _as_cs and _bin behave the same as they both are case sensitive and care about accents.

That is not the case. For example, even if umlauts are not treated as equal in _as_cs (just as in _bin), they are still not ordered by their (almost random) byte value, but based on what they are an umlaut of (or rather, as specified in the unicode standard).

E.g. for _as_cs, you get

A < Á < À < Ä < AE < Æ < B

while for _bin, you get

A < AE < B < À < Á < Ä < Æ

To be able to do this, MySQL uses the same techniques as for _ai_ci, you just get a different position than in _as_cs. Only for _bin, MySQL can simply take the byte value as is.

  • Related