This does not make sense to me. Can anyone explain it? I think the column values should be different, so
select * from a1 where f1 = f2;
should find no rows. But...
mysql> create table a1 (f1 varchar(63), f2 varchar(63));
Query OK, 0 rows affected (0.00 sec)
mysql> show create table a1 \G
*************************** 1. row ***************************
Table: a1
Create Table: CREATE TABLE `a1` (
`f1` varchar(63) DEFAULT NULL,
`f2` varchar(63) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
mysql>
mysql> insert into a1 values ('EFBBBFD187D0B5D0BBD0BED0B2D0B5D0BA', 'EFBBBFD187D0B5D0BBD0BED0B2D0B5CC81D0BA');
Query OK, 1 row affected (0.02 sec)
mysql> update a1 set f1 = unhex(f1);
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> update a1 set f2 = unhex(f2);
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from a1;
------------------- ---------------------
| f1 | f2 |
------------------- ---------------------
| человек | челове́к |
------------------- ---------------------
1 row in set (0.00 sec)
mysql>
mysql>
mysql> select * from a1 where f1 = f2;
------------------- ---------------------
| f1 | f2 |
------------------- ---------------------
| человек | челове́к |
------------------- ---------------------
1 row in set (0.00 sec)
mysql> select * from a1 where hex(f1) = hex(f2);
Empty set (0.00 sec)
mysql>
CodePudding user response:
Character equivalence is defined by the collation used by the columns in question. A collation defines every pair of characters as equal, less than, or greater than, and this is used for comparisons and for sorting.
Your table uses utf8mb4_0900_ai_ci
as the default collation, and this applies to all the columns, since they do not define a collation to override the table's default.
It's pretty common for collations to treat accented characters as equal to their unaccented versions.
If you want to choose a different collation, you may.
CodePudding user response:
The 3 bytes on the beginning, EFBBBF
, is "BOM", which indicates that the text is UTF-8-encoded.
The rest look like Cyrillic челове́к
, except for the "CC81 -- NSM COMBINING ACUTE ACCENT"
Some collations, including utf8mb4_0900_ai_ci
, handle "combining accents", some do not. The "ai" means "accent insensitive".
I would understand this equivalence for a "latin" e
. I don't know the rules for a CYRILLIC SMALL LETTER IE, which looks the same е
, but is encoded differently.
You might want COLLATE utf8mb4_0900_as_ci
, which is "accent sensitive and case insensitive".