I came across a sql code, which creates primary keys with hashbytes function and md5 algorithm. The code looks like this:
SELECT
CONVERT(VARBINARY(32),
CONVERT( CHAR(32),
HASHBYTES('MD5',
(LTRIM(RTRIM(COALESCE(column1,''))) ';' LTRIM(RTRIM(COALESCE(column2,''))))
),
2)
)
FROM database.schema.table
I find it hard to understand for what is the result from hashbytes function is converted to char and then to varbinary, when we get directly varbinary from hashbytes function. Is there any good reason to do so?
CodePudding user response:
Short Version
This code pads a hash with 0x20
bytes which is rather strange and most likely due to misunderstandings by the initial author. Using hashes as keys is a terrible idea anyway
Long Version
Hashes are completely inappropriate for generating primary keys. In fact, since the same hash can be generated from different original data, this code is guaranteed to produce duplicate values, causing collisions at best.
Worst case, you end up updating or deleting the wrong row, resulting in data loss. In fact, given that MD5 was broken over 20 years ago, one can calculate the values that would result in collisions. This has been used to hack systems in the past and even generate rogue CA certificates as far back as 2008.
And even worse, the concatenation expression :
(LTRIM(RTRIM(COALESCE(column1,''))) ';' LTRIM(RTRIM(COALESCE(column2,''))))
Will create the same initial string for multiple different column values.
On top of that, given the random nature of hash values, this results in table fragmentation and an index that can't be used for range queries. Primary keys most of the time are clustered keys as well, which means they specify the order rows are stored on disk. Using essentially random values for a PK means new rows can be added at the middle or even the start of a table's data pages.
This also harms caching, as data is loaded from disk in pages. With a meaningful clustered key, it's highly likely that loading a specific row will also load rows that will be needed very soon. Loading eg 50 rows while paging may only need to load a single page. With an essentially random key, you could end up loading 50 pages.
Using a GUID generated with NEWID()
would provide a key value without collisions. Using NEWSEQUENTIALID()
would generate sequential GUID values eliminating fragmentation and once again allowing range searches.
An even better solution would be to just create a PK from the two columns :
ALTER TABLE ThatTable ADD PRIMARY KEY (Column1,Column2);
Or just add an IDENTITY-generated ID column. A bigint
is large enough to handle all scenarios :
Create ThatTable (
ID bigint NOT NULL IDENTITY(1,1) PRIMARY KEY,
...
)
If the intention was to ignore spaces in column values there are better options:
- The easiest solution would be to clean up the values when inserting them.
- A
CHECK
constraint can be added to each column to ensure the columns can't have leading or trailing spaces. - An
INSTEAD OF
trigger can be used to trim them. - Computed, persisted columns can be added that trim the originals, eg
Column1_Cleaned as TRIM(Column1) PERSISTED
. Persisted columns can be used in indexes and primary keys
As for what it does:
- It generates deprecation warnings (
MD5
is deprecated) - It pads the MD5 hash with
0x20
bytes. A rather ... unusual way of padding data. I suspect whoever first wrote this wanted to pad the hash to 32 bytes but used some copy-pasta code without understanding the implications.
You can check the results by hashing any value. The following queries
select hashbytes('md5','banana')
----------------------------------
0x72B302BF297A228A75730123EFEF7C41
select cast(hashbytes('md5','banana') as char(32))
--------------------------------
r³¿)z"Šus#ïï|A
A space in ASCII is the byte 0x20
. Casting to binary replaces spaces with 0x20
, not 0x00
select cast(cast(hashbytes('md5','banana') as char(32)) as varbinary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4120202020202020202020202020202020
If one wanted to pad a 16-byte value to 32 bytes, it would make more sense to use 0x00
. The result is no better than the original though
select cast(hashbytes('md5','banana') as binary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4100000000000000000000000000000000
To get a real 32-byte hash, SHA2_256
can be used :
select hashbytes('sha2_256','banana')
------------------------------------------------------------------
0xB493D48364AFE44D11C0165CF470A4164D1E2609911EF998BE868D46ADE3DE4E