Home > Back-end >  Are there concerns if we have a database with mix of character set and collations?
Are there concerns if we have a database with mix of character set and collations?

Time:04-29

we are currently in the process of migrating latin1 databases to UTF8 in a MySQL Server.

We are investigating what is the best approach and hopefully without any downtime.

My question is if we will migrate a table to UTF8 in a latin1 database and for example, we have a query that performs actions on these two tables (let's say join for example)

is there any chance we will start getting not expected results because of the different collations/characters set?

I guess migrating the entire database will be safer but will involve downtime.

thank you.

CodePudding user response:

Yes and no.

Separate columns in a table can have different charsets and collations. Separate rows cannot.

Tables are independent until you JOIN. At that point, performance suffers if the column(s) you are Joining do not have the same charset and collation.

When inserting/fetching data, your client has a single charset. If that is utf8mb4, then MySQL will happily convert characters on the fly between that and what is in the table columns. Of course, some characters cannot be converted; for example, latin1 has far fewer distinct characters than utf8mb4.

You mentioned migration. You will need to either dump and reload (slower) or do lots of ALTERs (faster, but not "fast" for huge tables) or do something else. Please state the parameters you can live with. There are tradeoffs among speed, simplicity, etc.

A lot of people get in trouble when changing charsets. There are about 7 different ALTERs that one is tempered to use. If you perform the wrong one, it makes a mess that is really hard to unravel.

If you try to keep the database "live" while changing it, you could hit the JOIN performance problem mentioned above.

If you already have replication set up, there may be some techniques that take advantage of a "failover".

Keep the following Q&A at hand in case you encounter some gibberish: Trouble with UTF-8 characters; what I see is not what I stored

What version of MySQL are you using? If it 8.0, then default charset (utf8mb4) and its default collation are probably best.

CodePudding user response:

You should not get unexpected results.

If the character sets are totally incompatible, then you can't do the string comparison at all. You'll get an error.

If the character sets are compatible, but the collations are different, you should get the same result, but it will not be optimized with an index lookup. You can verify this with EXPLAIN, where you'll see the joined table uses a table-scan and may have "Using join buffer" in the extra column.

To solve the downtime problem, I've used pt-online-schema-change to perform ALTER TABLE <name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; (supposing you are using MySQL 8.0, that's a good choice for charset/collation), while continuing to read and write the original table. We did this (or other ALTER TABLE changes) many times per week, even during peak hours.

  • Related