Home > OS >  Calculating hash integer from a string in Athena
Calculating hash integer from a string in Athena

Time:12-30

I'm trying to calculate a hash from a string for best-effort ordering and partioning purposes in Athena. There is no String to hashCode() similar in Athena, so as a best effort, I try to get the 2nd character and calculate its codepoint and get the modulus. (As I said, best effort, maybe a nice effort)

Consider the query:

SELECT
    doc_id,
    substring(doc_id, 2, 1),
    typeof(split(substring(doc_id, 2, 1)))
FROM events LIMIT 100

The 3rd row returns a varchar but the codepoint function expects a varchar(1) and casting it does not work as cast(substring(doc_id, 2, 1) as varchar(1)).

FUNCTION_NOT_FOUND: line 6:5: Unexpected parameters (varchar) for function codepoint. Expected: codepoint(varchar(1))

How can I accomplish this task without modifiying the data source? I'm open to ideas.

CodePudding user response:

You can compute a hash code with the xxhash64 function. It takes a varbinary as input, so first cast the string to that type. Since the function also returns a 64-bit varbinary value, you can convert it to a bigint via the from_big_endian_64 function

WITH t(x) AS (VALUES 'hello')
SELECT from_big_endian_64(xxhash64(cast(x AS varbinary)))
FROM t

output:

        _col0
---------------------
 2794345569481354659
(1 row)
  • Related