I am new to Snowflake and want to know, can we use hashcodes for joining tables or finding unique records or deleting duplicate records in Snowflake(or in any other database in general)? I am designing an ETL flow, what are the advantages or disadvantages of using hashcodes and why are they generally not used often in most Data warehousing designs?
CodePudding user response:
If you mean hashing with something like md5_binary or sha1_binary then yes absolutely,
Binary values are half the byte length of the equivalent varchar length and so you should use that. The benefit of using hash-keys (effectively) is that you only need a single join column if for instance the natural keys of a table might be a composite key. Now you could instead a numeric/int data type, sequence key but that imposes a load order. Example only after the related dimension tables have loaded should you build the related fact table --- if you are doing that.
Data Vault prefers durable hash-keys because it does not impose any load ordering, load in any order independently.
Anyway I digress, yes hash-keys have great advantages, just make sure they're binary data types when loaded.