Home > Blockchain >  What is the purpose of this code in IdentityHashMap.hash()?
What is the purpose of this code in IdentityHashMap.hash()?

Time:02-26

/**
 * Returns index for Object x.
 */
private static int hash(Object x, int length) {
    int h = System.identityHashCode(x);
    // Multiply by -127, and left-shift to use least bit as part of hash
    return ((h << 1) - (h << 8)) & (length - 1);
}

From: jdk/IdentityHashMap.java at jdk8-b120 · openjdk/jdk · GitHub

In theory, the hash values returned by System.identityHashCode() are already uniformly distributed, so why is there an additional shift operation instead of a direct AND operation with length - 1?

The implementation seems to guarantee that the lowest bit is 0 to ensure that the result of the calculation is an even number, because the implementation requires all keys to be on even indices and all values to be on odd indices.

CodePudding user response:

The comments in the code say:

"Implementation note: This is a simple linear-probe hash table, as described for example in texts by Sedgewick and Knuth. The array alternates holding keys and values."

In fact the the hash method returns a value that is used as a direct index into the array. For example:

public V get(Object key) {
    Object k = maskNull(key);
    Object[] tab = table;
    int len = tab.length;
    int i = hash(k, len);
    while (true) {
        Object item = tab[i];
        if (item == k)
            return (V) tab[i   1];
        if (item == null)
            return null;
        i = nextKeyIndex(i, len);
    }
}

That means that hash needs to return an even value. The calculation in hash is ensuring that the index is even without throwing away the bottom bit of the System.identityHashCode(x) value.

Why not just throw away the bottom bit?

Well, the answer is in the way that System.identityHashCode is implemented. In reality, there are multiple algorithms for generating the hash, and the algorithm used (at runtime) depends on an obscure JVM command line option.

  • Some algorithms are (notionally) evenly distributed across the range of int. For those, discarding the bottom bit would be fine.

  • Other algorithm are not like this. One of algorithm uses a simple global counter. Another uses the object's memory address with the bottom 3 bits removed. If these algorithms are selected, discarding the LSB would increase the probability of hash collisions in IdentityHashMap.

See https://shipilev.net/jvm/anatomy-quarks/26-identity-hash-code/ for more information on IdentityHashcode algorithms and how they are selected. Note that this aspect of JVM behavior is unspecified and liable to be version specific.

CodePudding user response:

My hunch of what’s going on here is that it’s designed to address two issues.

First, the slot index this function produces must be an even number. (The implementation stores keys at even table slots and values at odd table slots.) This means that whatever index is returned must have its last bit equal to zero.

Second, the identity hash codes used are (potentially) based on memory addresses, and low bits of memory addresses are “more random” than high bits. For example, if we allocate a list of objects and the allocator places them all consecutively in memory, their addresses will all have the same high bits but different low bits. (Or perhaps there’s just a global counter of objects that’s incremented when an object is created. In that case, the low bits of object hashes will similarly have a wider dispersion than the high bits.)

To make sure things are spread out in the table, we’d therefore like to “mix” the low bits of the hash code with the “high” bits of the hash code. The effect of subtracting out h << 8 is to shift the low bits of the identity hash code higher up, flip them, and add them back to the hash code, causing a bunch of “ripples” as the addition plays out. I think (?) this is an effective way to then inject higher entropy low bits into the high bits, giving a more uniform hash over the array of slots once the table starts getting larger and larger.

  • Related