mysql compares accented vs unaccented characters as the same?-CodePudding

This does not make sense to me. Can anyone explain it? I think the column values should be different, so

select * from a1 where f1 = f2;

should find no rows. But...

mysql> create table a1 (f1 varchar(63), f2 varchar(63));
Query OK, 0 rows affected (0.00 sec)

mysql> show create table a1 \G
*************************** 1. row ***************************
       Table: a1
Create Table: CREATE TABLE `a1` (
  `f1` varchar(63) DEFAULT NULL,
  `f2` varchar(63) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

mysql> 
mysql> insert into a1 values ('EFBBBFD187D0B5D0BBD0BED0B2D0B5D0BA', 'EFBBBFD187D0B5D0BBD0BED0B2D0B5CC81D0BA');
Query OK, 1 row affected (0.02 sec)

mysql> update a1 set f1 = unhex(f1);
Query OK, 1 row affected (0.02 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> update a1 set f2 = unhex(f2);
Query OK, 1 row affected (0.02 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from a1;
 ------------------- --------------------- 
| f1                | f2                  |
 ------------------- --------------------- 
| человек          | челове́к           |
 ------------------- --------------------- 
1 row in set (0.00 sec)

mysql> 
mysql> 
mysql> select * from a1 where f1 = f2;
 ------------------- --------------------- 
| f1                | f2                  |
 ------------------- --------------------- 
| человек          | челове́к           |
 ------------------- --------------------- 
1 row in set (0.00 sec)

mysql> select * from a1 where hex(f1) = hex(f2);
Empty set (0.00 sec)

mysql>

CodePudding user response：

Character equivalence is defined by the collation used by the columns in question. A collation defines every pair of characters as equal, less than, or greater than, and this is used for comparisons and for sorting.

Your table uses utf8mb4_0900_ai_ci as the default collation, and this applies to all the columns, since they do not define a collation to override the table's default.

It's pretty common for collations to treat accented characters as equal to their unaccented versions.

If you want to choose a different collation, you may.

CodePudding user response：

The 3 bytes on the beginning, EFBBBF, is "BOM", which indicates that the text is UTF-8-encoded.

The rest look like Cyrillic челове́к, except for the "CC81 -- NSM COMBINING ACUTE ACCENT"

Some collations, including utf8mb4_0900_ai_ci, handle "combining accents", some do not. The "ai" means "accent insensitive".

I would understand this equivalence for a "latin" e. I don't know the rules for a CYRILLIC SMALL LETTER IE, which looks the same е, but is encoded differently.

You might want COLLATE utf8mb4_0900_as_ci, which is "accent sensitive and case insensitive".