Home > Enterprise >  How to correctly hash 'em dash' character in SQL Server?
How to correctly hash 'em dash' character in SQL Server?

Time:01-12

I am comparing hash values between Azure Databricks and SQL Server but get some unexpected results for non-ASCII characters. I have the problem narrowed down the the 'em dash' character (https://www.fileformat.info/info/unicode/char/2014/index.htm).

In T-SQL the character can be represented by NCHAR(8212), this is confirmed by

CASE WHEN '—' = NCHAR(8212) THEN 1 ELSE 0 END

Hashing the character using the HASHBYTES function with algorithm='SHA2_512'

CONVERT(VARCHAR(128),HASHBYTES('SHA2_512', CONVERT(VARCHAR(MAX), NCHAR(8212))), 2))

results in the value:

CABC0F199E27BC609E318E283A7CC3882599B1538EE7794082FD17AA6F4B32A3FF6B7169D87CA21D0DFD122C4D4D54E4CC40BF1F940489A2561C725A362B8A80.

Following the same process in Azure Databricks using the string representation '\u2014'. And confirming this by

CASE WHEN '—' = '\u2014' THEN 1 ELSE 0 END

Applying the SHA2 function to the character with bitLength=512

SHA2('\u2014', 512)

results in

48bf9ce5ab3755bf779032043c2ff6b0b075ec3407b3e473872ea2b059165a86bd70e0854ff6e9d05f1677a77fefcce7c6e3d73d1f3d9dd6488a2fecf041bb89

Pasting the character in online editors, such as https://emn178.github.io/online-tools/sha512.html, https://www.conversion-tool.com/sha512/, https://codebeautify.org/sha512-hash-generator and https://sha512.online/, results consistently in

48bf9ce5ab3755bf779032043c2ff6b0b075ec3407b3e473872ea2b059165a86bd70e0854ff6e9d05f1677a77fefcce7c6e3d73d1f3d9dd6488a2fecf041bb89

The same value as in Azure Databricks.

This makes me believe that the issue is with SQL Server. Are there any suggestions on how to make SQL Server hash this character 'correctly'?

CodePudding user response:

The hash value is based on the binary representation of the string.

The following

SELECT 
    CONVERT(VARCHAR(128),
            HASHBYTES('SHA2_512', CONVERT(VARCHAR(MAX), NCHAR(8212) COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8))
            , 2)

Returns the hash value you are after (using a UTF8 collation).

You are after

SELECT  HASHBYTES('SHA2_512', 0xE28094)

Some example results (showing method A yields the desired binary result to hash)

SELECT 
   A = CONVERT(VARBINARY(MAX), CONVERT(VARCHAR(MAX), NCHAR(8212) COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8)),
   B = CONVERT(VARBINARY(MAX), CONVERT(VARCHAR(MAX), NCHAR(8212) COLLATE LATIN1_GENERAL_100_CI_AS)),
   C = CONVERT(VARBINARY(MAX), NCHAR(8212))

Returns

A B C
0xE28094 0x97 0x1420
  • Related