I have some log data like:
logs = [
{'id': '1234', 'error': None, 'fruit': 'orange'},
{'id': '12345', 'error': None, 'fruit': 'apple'}
]
Each dict has the same keys: 'id'
, 'error'
and 'fruit'
(in this example).
I want to remove duplicates from this list, but straightforward dict
and set
based approaches do not work because my elements are themselves dict
s, which are not hashable:
>>> set(logs)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
Another approach is to sort and use itertools.groupby - but dicts are also not comparable, so this also does not work:
>>> from itertools import groupby
>>> [k for k, _ in groupby(sorted(logs))]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'dict' and 'dict'
I had the idea to calculate a hash value for each log entry, and store it in a set
for comparison, like so:
def compute_hash(log_dict: dict):
return hash(log_dict.values())
def deduplicate(logs):
already_seen = set()
for log in logs:
log_hash = compute_hash(log)
if log_hash in already_seen:
continue
already_seen.add(log_hash)
yield log
However, I found that compute_hash
would give the same hash for different dictionaries, even ones with completely bogus contents:
>>> logs = [{'id': '123', 'error': None, 'fruit': 'orange'}, {}]
>>> # The empty dict will be removed; every dict seems to get the same hash.
>>> list(deduplicate(logs))
[{'id': '123', 'error': None, 'fruit': 'orange'}]
After some experimentation, I was seemingly able to fix the problem by modifying compute_hash
like so:
def compute_hash(log_dict: dict):
return hash(frozenset(log_dict.values()))
However, I cannot understand why this makes a difference. Why did the original version seem to give the same hash for every input dict? Why does converting the .values
result to a frozenset
first fix the problem?
Aside from that: is this algorithm correct? Or is there some counterexample where the wrong values will be removed?
CodePudding user response:
What went wrong
The first thing I want to point out about the original attempt is that it seems over-engineered. When the inputs are hashable, manually iterating is only necessary to preserve order, and even then, in 3.7 and up we can rely on the order-preserving property of dict
s.
Just because it's hashable doesn't mean the hash is useful
It also isn't especially useful to call hash
on log_dict.values()
. While log_dict
is not hashable, its .values()
(in 3.x) is an instance of the dict_values
type (the name is not defined in the builtins, but that is how instances identify themselves), which is hashable:
>>> dv = {1:2, 3:4}.values()
>>> dv
dict_values([2, 4])
>>> {dv}
{dict_values([2, 4])}
So we could just as easily have used the .values()
directly as a "hash":
def compute_hash(log_dict: dict):
return log_dict.values()
... but this would have given a new bug - now every hash would be different:
>>> {1:2}.values() == {1:2}.values()
False
But why?
Because dict_values
type doesn't define __hash__
, nor __eq__
. object
is the immediate superclass, so calls to those methods fall back to the object
defaults:
>>> dv.__class__.__bases__
(<class 'object'>,)
>>> dv.__class__.__hash__
<slot wrapper '__hash__' of 'object' objects>
>>> dv.__class__.__eq__
<slot wrapper '__eq__' of 'object' objects>
In fact, dict_values
cannot sensibly implement these methods because it is (indirectly) mutable - as a view, it is dependent on the underlying dict:
>>> d = {1:2}
>>> dv = d.values()
>>> d[3] = 4
>>> dv
dict_values([2, 4])
Since there isn't an obvious generic way to hash any object that also isn't exceedingly slow, while also caring about its actual attributes, the default simply doesn't care about attributes and is simply based on object identity. For example, on my platform, the results look like:
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> dv = {1:2, 3:4}.values()
>>> bin(id(dv))
'0b11111110101110011010010110000001010101011110000'
>>> bin(hash(dv))
'0b1111111010111001101001011000000101010101111'
In other words:
>>> hash(dv) == id(dv) // 16
True
Thus, if compute_hash
in the original code is repeatedly called with temporary objects, it won't give useful results - the results don't depend on the contents of the object, and will commonly be the same, as temporary (i.e., immediately GCd) objects in a loop will often end up in the same memory location.
(Yes, this means that objects default to being hashable and equality-comparable. The dict
type itself overrides __hash__
to explicitly disallow it, while - curiously - overriding __eq__
to compare contents.)
frozenset
has a useful hash
On the other hand, frozenset
is intended for long-term storage of some immutable data. Consequently, it's important and useful for it to define a __hash__
, and it does:
>>> f = frozenset(dv)
>>> bin(id(f))
'0b11111110101110011010001011101000110001011100000'
>>> bin(hash(f))
'0b101111010001101001001111100001000001100111011101101100000110001'
Dictionaries, hashing and collision detection
Although there have been many tweaks and optimizations over the years, Pythons dict
and set
types are both fundamentally based on hash tables. When a value is inserted, its hash is first computed (normally an integer value), and then that value is reduced (normally using modulo) into an index into the underlying table storage. Similarly, when a value is looked up, the hash is computed and reduced in order to determine where to look in the table for that value.
Of course, it is possible that some other value is already stored in that spot. There are multiple possible strategies for dealing with this (and last I checked, the literature is inconsistent about naming them). But most importantly for our purposes: when looking up a value in a dict
by key, or checking for the presence of a value in a set
, the container will also have to do equality checks after figuring out where to look, in order to confirm that the right thing has actually been found.
Consequently, any approach that simply computes a hash manually, and naively associates those hashes with the original values, will fail. It is easy for two of the input dicts to have the same computed hash value, even if their contents are actually being considered. For example, the hash of a frozenset
is based on an XOR of hashes for the elements. So if two of our input dicts had all the same values assigned to keys in a different order, the hash would be the same:
>>> def show_hash(d):
... return bin(hash(frozenset(d.values())))
...
>>> show_hash({'id': '1', 'error': None, 'value': 'apple'})
'0b101010010100001000111001000001000111101111110100010000010101110'
>>> # Changing a value changes the hash...
>>> show_hash({'id': '1', 'error': None, 'value': 'orange'})
'0b11111111001000011101011001001011100010100100010010110000100100'
>>> # but rearranging them does not:
>>> show_hash({'id': '1', 'error': 'orange', 'value': None})
'0b11111111001000011101011001001011100010100100010010110000100100'
It's also possible for such a hash collision to occur by coincidence with totally unrelated values. It's extremely unlikely for 64-bit hashes (since this value will not be reduced and used as a hash table index, despite the name)
Fixing it explicitly
So, in order to have correct code, we would need to do our own checking afterwards, explicitly checking whether the value which hashed to something in our already_seen
set was actually equal to previous values that had that hash. And there could theoretically be multiple of those, so we'd have to remember multiple values for each of those external hashes, perhaps by using a dict
for already_seen
instead. Something like:
from collections import defaultdict
def deduplicate(logs):
already_seen = defaultdict(list)
for log in logs:
log_hash = compute_hash(log)
if log in already_seen.get(log_hash, ()):
continue
already_seen[log_hash].append(log)
yield log
Hopefully this immediately looks unsatisfactory. With this approach, we are essentially re-implementing the core logic of sets and dictionaries - we compute hashes ourselves, retrieve corresponding values from internal storage (already_seen
) and then manually check for equality (if log in ...
).
Looking at it from another angle
The reason we're doing all of this in the first place - looking for a hash value to represent the original dict in our own storage - is because the dict isn't hashable. What if we simply addressed that problem head-on, instead?
That is to say: rather than simply computing a value from our original data that happens to be hashable, let's explicitly convert the data into a hashable form, preserving all the information. In other words, use a different type to represent the data, rather than a dict
.
Since all our input dict
s have the same keys, the natural thing to do would be to convert those into the attributes of a user-defined class. In 3.7 and up, a simple, natural and explicit way to do this is using a dataclass, like so:
from dataclasses import dataclass
from typing import Optional
@dataclass(frozen=True, slots=True)
class LogEntry:
id: str
error: Optional[str]
fruit: str
It's not explained very well in the documentation, but using frozen=True
(the main purpose is to make the instances immutable) will cause a __hash__
to be generated as well, taking the fields into account as desired. Using slots=True
causes __slots__
to be generated for the type as well, avoiding memory overhead.
From here, it's trivial to convert the existing logs:
logs = [LogEntry(**d) for d in logs]
And we can directly deduplicate with a set
:
set(logs)
or, preserving order using a dict
(in 3.7 and up):
list(dict.fromkeys(logs))
There are other options, of course. The simplest is to make a tuple
from the .values
- assuming each log dict has its keys in the same order (again, assuming Python 3.7 and up, where keys have an order), this preserves all the useful information - the .keys
are just for convenience. Slightly more sophisticated, we could use collections.namedtuple
:
from collections import namedtuple
LogEntry = namedtuple('LogEntry', 'id error fruit')
# from here, use the LogEntry type as before
This is simpler than the dataclass
approach, but less explicit (and doesn't offer an elegant way to document field types).
CodePudding user response:
You have some working answers, but I think you may be overcomplicating things. Here's the quick fix I would have done on your original code.
logs = [
{'id': '1234', 'error': None, 'fruit': 'orange'},
{'id': '1234', 'error': None, 'fruit': 'orange'},
{'id': '12345', 'error': None, 'fruit': 'apple'},
]
def get_values(log: dict):
return tuple(log.values())
unique_logs = set(map(get_values, logs))
for log in unique_logs:
print(log)
('12345', None, 'apple')
('1234', None, 'orange')
CodePudding user response:
You can't base "already seen" on just a hash. Hashes are smaller than the actual data but can have collisions, that is the tradeoff. Use hashes to group logs and then check for equality.
There will still be collisions.
Dicts already group keys by hash and check equality for you, no need to reinvent this. Your logs are dictionaries, which are unhashable because they are mutable. One simple way could be to convert your dicts to strings using json.dumps(). Or for more efficient storage, find something similar to frozenset but for dicts.
already_seen = set()
for log in logs:
log_hash = json.dumps(log, sort_keys=True)
if log_hash in already_seen:
continue
already_seen.add(log_hash)