Home > database >  How can I properly hash dictionaries with a common set of keys, for deduplication purposes?
How can I properly hash dictionaries with a common set of keys, for deduplication purposes?

Time:01-01

I have some log data like:

logs = [
 {'id': '1234', 'error': None, 'fruit': 'orange'},
 {'id': '12345', 'error': None, 'fruit': 'apple'}
]

Each dict has the same keys: 'id', 'error' and 'fruit' (in this example).

I want to remove duplicates from this list, but straightforward dict and set based approaches do not work because my elements are themselves dicts, which are not hashable:

>>> set(logs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'

Another approach is to sort and use itertools.groupby - but dicts are also not comparable, so this also does not work:

>>> from itertools import groupby
>>> [k for k, _ in groupby(sorted(logs))]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'dict' and 'dict'

I had the idea to calculate a hash value for each log entry, and store it in a set for comparison, like so:

def compute_hash(log_dict: dict):
    return hash(log_dict.values())

def deduplicate(logs):
    already_seen = set()
    for log in logs:
        log_hash = compute_hash(log)
        if log_hash in already_seen:
            continue
        already_seen.add(log_hash)
        yield log

However, I found that compute_hash would give the same hash for different dictionaries, even ones with completely bogus contents:

>>> logs = [{'id': '123', 'error': None, 'fruit': 'orange'}, {}]
>>> # The empty dict will be removed; every dict seems to get the same hash.
>>> list(deduplicate(logs))
[{'id': '123', 'error': None, 'fruit': 'orange'}]

After some experimentation, I was seemingly able to fix the problem by modifying compute_hash like so:

def compute_hash(log_dict: dict):
    return hash(frozenset(log_dict.values()))

However, I cannot understand why this makes a difference. Why did the original version seem to give the same hash for every input dict? Why does converting the .values result to a frozenset first fix the problem? Aside from that: is this algorithm correct? Or is there some counterexample where the wrong values will be removed?

CodePudding user response:

What went wrong

The first thing I want to point out about the original attempt is that it seems over-engineered. When the inputs are hashable, manually iterating is only necessary to preserve order, and even then, in 3.7 and up we can rely on the order-preserving property of dicts.

Just because it's hashable doesn't mean the hash is useful

It also isn't especially useful to call hash on log_dict.values(). While log_dict is not hashable, its .values() (in 3.x) is an instance of the dict_values type (the name is not defined in the builtins, but that is how instances identify themselves), which is hashable:

>>> dv = {1:2, 3:4}.values()
>>> dv
dict_values([2, 4])
>>> {dv}
{dict_values([2, 4])}

So we could just as easily have used the .values() directly as a "hash":

def compute_hash(log_dict: dict):
    return log_dict.values()

... but this would have given a new bug - now every hash would be different:

>>> {1:2}.values() == {1:2}.values()
False

But why?

Because dict_values type doesn't define __hash__, nor __eq__. object is the immediate superclass, so calls to those methods fall back to the object defaults:

>>> dv.__class__.__bases__
(<class 'object'>,)
>>> dv.__class__.__hash__
<slot wrapper '__hash__' of 'object' objects>
>>> dv.__class__.__eq__
<slot wrapper '__eq__' of 'object' objects>

In fact, dict_values cannot sensibly implement these methods because it is (indirectly) mutable - as a view, it is dependent on the underlying dict:

>>> d = {1:2}
>>> dv = d.values()
>>> d[3] = 4
>>> dv
dict_values([2, 4])

Since there isn't an obvious generic way to hash any object that also isn't exceedingly slow, while also caring about its actual attributes, the default simply doesn't care about attributes and is simply based on object identity. For example, on my platform, the results look like:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> dv = {1:2, 3:4}.values()
>>> bin(id(dv))
'0b11111110101110011010010110000001010101011110000'
>>> bin(hash(dv))
'0b1111111010111001101001011000000101010101111'

In other words:

>>> hash(dv) == id(dv) // 16
True

Thus, if compute_hash in the original code is repeatedly called with temporary objects, it won't give useful results - the results don't depend on the contents of the object, and will commonly be the same, as temporary (i.e., immediately GCd) objects in a loop will often end up in the same memory location.

(Yes, this means that objects default to being hashable and equality-comparable. The dict type itself overrides __hash__ to explicitly disallow it, while - curiously - overriding __eq__ to compare contents.)

frozenset has a useful hash

On the other hand, frozenset is intended for long-term storage of some immutable data. Consequently, it's important and useful for it to define a __hash__, and it does:

>>> f = frozenset(dv)
>>> bin(id(f))
'0b11111110101110011010001011101000110001011100000'
>>> bin(hash(f))
'0b101111010001101001001111100001000001100111011101101100000110001'

Dictionaries, hashing and collision detection

Although there have been many tweaks and optimizations over the years, Pythons dict and set types are both fundamentally based on hash tables. When a value is inserted, its hash is first computed (normally an integer value), and then that value is reduced (normally using modulo) into an index into the underlying table storage. Similarly, when a value is looked up, the hash is computed and reduced in order to determine where to look in the table for that value.

Of course, it is possible that some other value is already stored in that spot. There are multiple possible strategies for dealing with this (and last I checked, the literature is inconsistent about naming them). But most importantly for our purposes: when looking up a value in a dict by key, or checking for the presence of a value in a set, the container will also have to do equality checks after figuring out where to look, in order to confirm that the right thing has actually been found.

Consequently, any approach that simply computes a hash manually, and naively associates those hashes with the original values, will fail. It is easy for two of the input dicts to have the same computed hash value, even if their contents are actually being considered. For example, the hash of a frozenset is based on an XOR of hashes for the elements. So if two of our input dicts had all the same values assigned to keys in a different order, the hash would be the same:

>>> def show_hash(d):
...     return bin(hash(frozenset(d.values())))
... 
>>> show_hash({'id': '1', 'error': None, 'value': 'apple'})
'0b101010010100001000111001000001000111101111110100010000010101110'
>>> # Changing a value changes the hash...
>>> show_hash({'id': '1', 'error': None, 'value': 'orange'})
'0b11111111001000011101011001001011100010100100010010110000100100'
>>> # but rearranging them does not:
>>> show_hash({'id': '1', 'error': 'orange', 'value': None})
'0b11111111001000011101011001001011100010100100010010110000100100'

It's also possible for such a hash collision to occur by coincidence with totally unrelated values. It's extremely unlikely for 64-bit hashes (since this value will not be reduced and used as a hash table index, despite the name)

Fixing it explicitly

So, in order to have correct code, we would need to do our own checking afterwards, explicitly checking whether the value which hashed to something in our already_seen set was actually equal to previous values that had that hash. And there could theoretically be multiple of those, so we'd have to remember multiple values for each of those external hashes, perhaps by using a dict for already_seen instead. Something like:

from collections import defaultdict

def deduplicate(logs):
    already_seen = defaultdict(list)
    for log in logs:
        log_hash = compute_hash(log)
        if log in already_seen.get(log_hash, ()):
            continue
        already_seen[log_hash].append(log)
        yield log

Hopefully this immediately looks unsatisfactory. With this approach, we are essentially re-implementing the core logic of sets and dictionaries - we compute hashes ourselves, retrieve corresponding values from internal storage (already_seen) and then manually check for equality (if log in ...).

Looking at it from another angle

The reason we're doing all of this in the first place - looking for a hash value to represent the original dict in our own storage - is because the dict isn't hashable. What if we simply addressed that problem head-on, instead?

That is to say: rather than simply computing a value from our original data that happens to be hashable, let's explicitly convert the data into a hashable form, preserving all the information. In other words, use a different type to represent the data, rather than a dict.

Since all our input dicts have the same keys, the natural thing to do would be to convert those into the attributes of a user-defined class. In 3.7 and up, a simple, natural and explicit way to do this is using a dataclass, like so:

from dataclasses import dataclass
from typing import Optional

@dataclass(frozen=True, slots=True)
class LogEntry:
    id: str
    error: Optional[str]
    fruit: str

It's not explained very well in the documentation, but using frozen=True (the main purpose is to make the instances immutable) will cause a __hash__ to be generated as well, taking the fields into account as desired. Using slots=True causes __slots__ to be generated for the type as well, avoiding memory overhead.

From here, it's trivial to convert the existing logs:

logs = [LogEntry(**d) for d in logs]

And we can directly deduplicate with a set:

set(logs)

or, preserving order using a dict (in 3.7 and up):

list(dict.fromkeys(logs))

There are other options, of course. The simplest is to make a tuple from the .values - assuming each log dict has its keys in the same order (again, assuming Python 3.7 and up, where keys have an order), this preserves all the useful information - the .keys are just for convenience. Slightly more sophisticated, we could use collections.namedtuple:

from collections import namedtuple

LogEntry = namedtuple('LogEntry', 'id error fruit')
# from here, use the LogEntry type as before

This is simpler than the dataclass approach, but less explicit (and doesn't offer an elegant way to document field types).

CodePudding user response:

You have some working answers, but I think you may be overcomplicating things. Here's the quick fix I would have done on your original code.

logs = [
    {'id': '1234', 'error': None, 'fruit': 'orange'},
    {'id': '1234', 'error': None, 'fruit': 'orange'},
    {'id': '12345', 'error': None, 'fruit': 'apple'}, 
]

def get_values(log: dict):
    return tuple(log.values())

unique_logs = set(map(get_values, logs))
for log in unique_logs:
    print(log)

('12345', None, 'apple')
('1234', None, 'orange')

CodePudding user response:

  1. You can't base "already seen" on just a hash. Hashes are smaller than the actual data but can have collisions, that is the tradeoff. Use hashes to group logs and then check for equality.

  2. There will still be collisions.

  3. Dicts already group keys by hash and check equality for you, no need to reinvent this. Your logs are dictionaries, which are unhashable because they are mutable. One simple way could be to convert your dicts to strings using json.dumps(). Or for more efficient storage, find something similar to frozenset but for dicts.

already_seen = set()
for log in logs:
    log_hash = json.dumps(log, sort_keys=True)
    if log_hash in already_seen:
        continue
    already_seen.add(log_hash)
  • Related