I am comparing hash values between Azure Databricks and SQL Server but get some unexpected results for non-ASCII characters. I have the problem narrowed down the the 'em dash' character (https://www.fileformat.info/info/unicode/char/2014/index.htm).
In T-SQL the character can be represented by NCHAR(8212)
, this is confirmed by
CASE WHEN '—' = NCHAR(8212) THEN 1 ELSE 0 END
Hashing the character using the HASHBYTES
function with algorithm='SHA2_512'
CONVERT(VARCHAR(128),HASHBYTES('SHA2_512', CONVERT(VARCHAR(MAX), NCHAR(8212))), 2)
)
results in the value:
CABC0F199E27BC609E318E283A7CC3882599B1538EE7794082FD17AA6F4B32A3FF6B7169D87CA21D0DFD122C4D4D54E4CC40BF1F940489A2561C725A362B8A80
.
Following the same process in Azure Databricks using the string representation '\u2014'
. And confirming this by
CASE WHEN '—' = '\u2014' THEN 1 ELSE 0 END
Applying the SHA2
function to the character with bitLength=512
SHA2('\u2014', 512)
results in
48bf9ce5ab3755bf779032043c2ff6b0b075ec3407b3e473872ea2b059165a86bd70e0854ff6e9d05f1677a77fefcce7c6e3d73d1f3d9dd6488a2fecf041bb89
Pasting the character in online editors, such as https://emn178.github.io/online-tools/sha512.html, https://www.conversion-tool.com/sha512/, https://codebeautify.org/sha512-hash-generator and https://sha512.online/, results consistently in
48bf9ce5ab3755bf779032043c2ff6b0b075ec3407b3e473872ea2b059165a86bd70e0854ff6e9d05f1677a77fefcce7c6e3d73d1f3d9dd6488a2fecf041bb89
The same value as in Azure Databricks.
This makes me believe that the issue is with SQL Server. Are there any suggestions on how to make SQL Server hash this character 'correctly'?
CodePudding user response:
The hash value is based on the binary representation of the string.
The following
SELECT
CONVERT(VARCHAR(128),
HASHBYTES('SHA2_512', CONVERT(VARCHAR(MAX), NCHAR(8212) COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8))
, 2)
Returns the hash value you are after (using a UTF8 collation).
SELECT HASHBYTES('SHA2_512', 0xE28094)
Some example results (showing method A yields the desired binary result to hash)
SELECT
A = CONVERT(VARBINARY(MAX), CONVERT(VARCHAR(MAX), NCHAR(8212) COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8)),
B = CONVERT(VARBINARY(MAX), CONVERT(VARCHAR(MAX), NCHAR(8212) COLLATE LATIN1_GENERAL_100_CI_AS)),
C = CONVERT(VARBINARY(MAX), NCHAR(8212))
Returns
A | B | C |
---|---|---|
0xE28094 | 0x97 | 0x1420 |