I'm trying to calculate a hash from a string for best-effort ordering and partioning purposes in Athena. There is no String to hashCode() similar in Athena, so as a best effort, I try to get the 2nd character and calculate its codepoint and get the modulus. (As I said, best effort, maybe a nice effort)
Consider the query:
SELECT
doc_id,
substring(doc_id, 2, 1),
typeof(split(substring(doc_id, 2, 1)))
FROM events LIMIT 100
The 3rd row returns a varchar
but the codepoint function expects a varchar(1)
and casting it does not work as cast(substring(doc_id, 2, 1) as varchar(1))
.
FUNCTION_NOT_FOUND: line 6:5: Unexpected parameters (varchar) for function codepoint. Expected: codepoint(varchar(1))
How can I accomplish this task without modifiying the data source? I'm open to ideas.
CodePudding user response:
You can compute a hash code with the xxhash64
function. It takes a varbinary
as input, so first cast the string to that type. Since the function also returns a 64-bit varbinary
value, you can convert it to a bigint
via the from_big_endian_64
function
WITH t(x) AS (VALUES 'hello')
SELECT from_big_endian_64(xxhash64(cast(x AS varbinary)))
FROM t
output:
_col0
---------------------
2794345569481354659
(1 row)