Home > Blockchain >  Hashing and encoding sequence
Hashing and encoding sequence

Time:12-08

I am studying the high-level design of generating a TinyURL like service, from here.

The author makes a statement:

"To generate a unique short URL, we can compute it using the Unique Hash (MD5, SHA256, etc.) of the original URL and then encode using base62."

I understand what hashing and encoding mean in general terms, but I do not understand the sequence - hashing first followed by encoding. What is the reasoning behind following this order? Does it always remain the same? Why not the other way round - encoding first and then hashing?

Thanks!

Edit: I would like to clarify that all online resources (that I could lay my hands on) follow the same sequence.

CodePudding user response:

Hashing vs encoding are accomplishing two different things.

Hashing will take an arbitrary (potentially long) string, and emit a fixed-size (generally short) pile of bytes. The mapping is typically not reversible and map appear random, but the same input will always get mapped to the same output. We normally want to avoid situations where many common inputs are mapped to the same output - like for example, if I map every input string to "hello" then that's technically a hash function but it's also completely useless. That's called wanting to avoid "hash collisions", and popular hash functions (including MD5 and SHA256) do a good job of this.

The encoding layer is much simpler. Base62 encoding just means taking a pile byte data and rewriting it using alphanumeric characters (A-Z, a-z, and 0-9). The output size will be approximately a constant times the size of the input, and the process is completely reversible. This is useful if you want to make some arbitrary data into a valid URL, since otherwise many bytes are not printable or not legal in URLs.

If you hash and then encode, you'll go: (initial URL) --> (short representation that's not printable) --> (short representation that uses URL-legal characters). Useful!

If you encode first and then hash, you'll go: (initial URL) --> (similar-length representation that still uses URL-legal characters) --> (short representation that's not printable). This time we didn't end up where we wanted (end result is not printable) and also the first step was just kinda pointless.

  • Related