Home > Software design >  MySQL comparison between 'æ' and 'ae'
MySQL comparison between 'æ' and 'ae'

Time:03-02

My MySQL server doesn't recognize the difference between characters 'æ' and 'ae' while storing data to database and that creates some problems for me. My goal is to find a charset which recognizes the difference between those characters, and I found it (utfmb3), but it is going to be deprecated, and the new alternative (utfmb4) doesn't recognize those characters as different.

What I've tried:

set names 'utf8mb3';
select 'æ' = 'ae';

This select returns 0 (false), which means this charset sees these as different characters, and that's just what I need, but MySQL gives me a warning: 'utf8mb3' is deprecated and will be removed in a future release. Please use utf8mb4 instead

But when I do

set names 'utf8mb4';
select 'æ' = 'ae';

This select returns 1, which means utf8mb4 sees these as the same characters, which is not good..

So, my dilema is, what charset to use? If I use utfmb3, it will be deprecated soon, that's no good. If I use utfmb4, that won't work correctly.

CodePudding user response:

= and LIKE comparisons in WHERE clauses apply a collation (not just a character set) to determine this kind of equality. This statement returns zero for the first two collations and one for the second two.

SELECT 'æ' = 'ae' COLLATE utf8mb4_unicode_ci,       -- 0
       'æ' = 'ae' COLLATE utf8mb4_general_ci,       -- 0
       'æ' = 'ae' COLLATE utf8mb4_unicode_520_ci,   -- 1
       'æ' = 'ae' COLLATE utf8mb4_german2_ci        -- 1

It seems likely your default collation is one of the last two or some other collation that handles that equality test the way you don't want it.

You can see your connection's collation setting with this statement. I suspect it is utf8mb4_unicode_520_ci.

SELECT @@collation_connection;

Be sure to define the collation for your columns with one you do want, and set your connection collation to the same thing. utf8mb4_unicode_ci is suitable. Try this.

SET collation_connection = 'utf8mb4_unicode_ci';
SELECT 'æ' = 'ae'   -- 0;

It's hard to give more specific advice without understanding your linguistic requirements better.

More info here: Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL?

CodePudding user response:

Coalition 'utf8mb4_unicode_ci' is the current one you want to use. Make sure you're setting your client (ie php, node. python) to use the correct charset as well (both in the db client object and the environment config).

  • Related