Home > OS >  Does a surrogate (INT) key almost always yield better performance than an unique natural (VARCHAR) k
Does a surrogate (INT) key almost always yield better performance than an unique natural (VARCHAR) k

Time:04-24

I am struggling to understand what datatype to use for an MySQL database table.

Let's say we book publishing company and we need to create a database of all our books and authors in MySQL database. We have around 500000 books. A book has an unique ISBN (for example 978-3-16-148410-0).

So we have two options here to store our books:

  1. Create a id VARCHAR(24) NOT NULL natural primary key column and store our ISBNs there, or
  2. Create a surrogate id INT NOT NULL AUTO_INCREMENT but then also a isbn UNIQUE VARCHAR(24) column

What from I understand, the general consensus is not to use VARCHAR(n) as a primary keys as it takes more storage and performance doing lookups and joins, and generally that makes sense for me.

However, if all our operations will be towards ISBNs (SELECT * FROM books WHERE isbn = ?, UPDATE, DELETE, etc) - Why not use the VARCHAR(24) as the primary key?

I am struggling to understand that if you have an immutable natural key (like a book's ISBN) and 95% of all database operations require using that field regardless, shouldn't using a VARCHAR(24) always outperform a surrogate key design?

I feel like having a surrogate AUTO_INCREMENT INT key here, is just totally meaningless. It does not give any benefits.

Or am I missing something fundamental here when it comes to determining primary keys.

CodePudding user response:

I would use the ISBN as the primary key.

Primary key lookups in MySQL's default storage engine InnoDB are more efficient than lookups by secondary index.

It's true an integer takes less storage space than a 24-character varchar, but in your case, I assume you have to store the ISBN anyway. If you could use an integer instead of the ISBN, that would save storage.

The comment above that natural keys tend to violate uniqueness is a good warning in general. The violations usually come from the marketing department. ;-)

But for a given dataset, you can be sure that the natural key is free of duplicates. If you do experience an error reading the ISBN in your library collection, the librarian will have to resolve that manually. But I don't expect that to happen very often for 500,000 books.

Tip: Define the varchar with a binary collation, and it'll be a bit faster to do string comparisons. For example:

CREATE TABLE Books (
 isbn varchar(24) COLLATE utf8mb4_bin,
 -- ...other columns...
 PRIMARY KEY (isbn)
) DEFAULT CHARSET=utf8mb4;
  • Related