Home > Software engineering >  Ordered string to integer hash function preserving lexicographic order of its argument
Ordered string to integer hash function preserving lexicographic order of its argument

Time:03-18

Let's say we have a collection of byte strings, sorted in lexicographic order, as usual. We want to define a hash function, mapping a string to an integer, in such a way that ordering of hash values preserves the ordering of the strings to a sufficient degree. That is, given string A being lesser or equal to string B, H(A) should always yield a value, which is lesser or equal to H(B).

Clearly, a not so good hash function of this sort is possible. For example, we can take a fixed prefix of each string (say, 8 bytes) and pretend it to be a big-endian unsigned int64. The resulting integers will be sorted in a desirable order. This approach even works for shorter strings: we can append some 0s to a short string to make it at least prefix bytes long (but only if we can assume that 0 is not a valid byte value).

Unfortunately, this potential solution, while fast and easy, has major drawbacks. It becomes rather useless in cases where strings tend to feature sizeable common prefixes. It can not handle strings shorter than chosen prefix when '0x00' is a meaningful byte and we want to sort shorter strings before longer ones.

So the question is whether it is possible to do any better? Some arithmetic (or rather Knuth's "Concrete Mathematics" sort of) trick which can consider all the bytes of the string and yield an appropriately ordered hash value?

CodePudding user response:

The best you can do is to apply an order-preserving arithmetic encoding, based on the best statistical model of the strings that you can come up with, and then take a prefix of that to form the "hash" code.

Each hash code will then be equally likely, according to that statistical model.

If your model is just that all strings are equally likely, then this reduces to your "just take a prefix idea"... so whether or not this will work for you really depends on how much you know about your strings and how good you need this code to be.

Also note that many realistic models will also allow a simpler encoding scheme. "just take a prefix" is again an example of this.

Most things that people might think they want to do with a "hash code" like this are not practical -- you will probably end up doing something else. Maybe you want to ask about your real problem, so we can help solve it in some other way.

  • Related