Home > other >  What is the safest way to hash a list of non-repeating integers without taking order into account?
What is the safest way to hash a list of non-repeating integers without taking order into account?

Time:12-21

I am looking for a hash function, that can hash a list of non-repeating integers while ignoring the order of them.

Example

I want the two lists

l1 = [0, 1, 3, 7]
l2 = [7, 3, 1, 0]

to have the same hash.

Background

I have an algorithm that finds a list of vertices on a graph. In an undirected graph, the algorithm will find certain lists multiple times in different orders. With my current understanding of the algorithm, it is easier to filter out the duplicates rather than re-inventing the algorithm. For performance reasons, I understand it to be easier to hash the found lists of vertices rather than comparing the whole lists.

Possible answers

Now, I see that

  • an XOR or a simple sum might be an answer.
    Unfortunately, both offer too much potential for hash collisions, as I see it.
  • The not-very-efficient working method is to sort a list, and then use this sorted list to compare the new list (also sorted) against.

Other Thoughts

Given that

  • The lists contain only integers.
  • The integers will be the vertex indices, and the graph can have billions of vertices.
  • The integers in a list are non-repeating, and their order doesn't matter.
  • The lists can and will consist of between 2 and 100 (and in some cases > 1000) entries.
  • No need for cryptographically-secure randomness.

I have this feeling that there should be a relatively easy and straight-forward answer, and I just have not found it.

CodePudding user response:

Use a combination of the product, sum and ^. All are communitive (order independent) with unsigned math.

unsigned long long product = 1;
unsigned sum = 0;  // Maybe unsigned long long
unsigned x = 0;
for (i=0; i < array_element_count; i  ) {
  product *= l[i];
  sum  = l[i];
  x ^= l[i];
}
unsigned long long pre_hash = product   sum   ((unsigned long long) x << 32));
unsigned hash = pre_hash % hash_table_size;

Tip: hash_table_size should be a prime to effectively use all pre_hash bits.


If array_element_count was high, I would consider p *= shift_right_until_odd(l[i]), else p will too often become 0.

If l[i] == 0 p *= l[i] deserves something different. A simple mitigation is p *= l[i] | 1, but that is something pulled out of the air.

Hashing takes time for good design and the above are candidate building blocks for OP.

CodePudding user response:

I think you will have to invent one to avoid the slow sorting option. In addition to XOR and arithmetic addition, there are bit rotations, and bit masks you could use. If you need high collision resistance, you could just combine more than one of the hash functions. e.g. Assuming the d_i and arithmetic are modular like with uint32_t for example,

H_1 = sum_{i = 1 to n} d_i
H_2 = xor_{i = 1 to n} d_i
H_3 = xor_{i = 1 to n} (rotl(d_i, d_i & 0x1f)   c)

Then take H1H2H3 as a 12 byte hash.

  • Related