Context:
I have a set of logs with the same keys but different values. the keys are guaranteed to be str
and the values can either be a str
or None
.
For example:
logs = [
{'id': '1234', 'error': None, 'fruit': 'orange'},
{'id': '12345', 'error': None, 'fruit': 'apple'}
]
Sometimes these logs are duplicated. The same keys and values are same for the dictionaries.
{'id': '1234', 'error': None, 'fruit': 'orange'}
{'id': '1234', 'error': None, 'fruit': 'orange'}
I am processing them as follows:
already_seen = set()
for log in logs:
log_hash = compute_hash(log)
if log_hash in already_seen:
continue
already_seen.add(log_hash)
Initially, I used the following hash function:
def compute_hash(log_dict: dict):
return hash(log_dict.values())
But this gave the same hash for different dictionaries. So, I changed the function to:
def compute_hash(log_dict: dict):
return hash(frozenset(log_dict.values()))
Now it is working correctly (as far as I have noticed).
Questions:
- Why did the initial approach not work?
- Is the current approach correct? Where correctness is:
- Same hash value for a dict with same values.
- Different hash value for dictionaries with different values.
- Are there any better or simpler approaches for the problem?
Example:
dict_1 = {
"date": "2022-12-30 18:04:35.488",
"tenant": "hz",
"action": "REFRESH_KUBE_WORKLOAD",
"code": "SERVER_SENT_REQ_TO_AGENT",
"message_id": "ae62aa57-b155-452f-b7bd-740c18110aa7",
"error": None,
}
dict_2 = {
"date": "2022-12-30 18:04:35.479",
"tenant": "hz",
"action": "REFRESH_KUBE_WORKLOAD",
"code": "SERVER_FAIL",
"message_id": "5b0eb8d9-9c6a-4c6d-bd27-b03ac0b865e8",
"error": " Timeout",
}
print(hash(dict_1.values()) == hash(dict_2.values()))
CodePudding user response:
In Python 3.x, .values
on a dict
returns an instance of dict_values
(the name is not defined in the builtins, but that is how instances identify themselves):
>>> dv = {1:2, 3:4}.values()
>>> dv
dict_values([2, 4])
This class is intended as a view on an existing underlying object, and does not define its own __hash__
method. object
is the immediate superclass, so calls to __hash__
fall back to object.__hash__
:
>>> dv.__class__.__bases__
(<class 'object'>,)
>>> dv.__class__.__hash__
<slot wrapper '__hash__' of 'object' objects>
In fact, dict_values
cannot sensibly implement __hash__
because it is (indirectly) mutable - as a view, it is dependent on the underlying dict:
>>> d = {1:2}
>>> dv = d.values()
>>> d[3] = 4
>>> dv
dict_values([2, 4])
Since there isn't an obvious generic way to hash any object that also isn't exceedingly slow, while also caring about its actual attributes, the default simply doesn't care about attributes and is simply based on object identity. For example, on my platform, the results look like:
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> dv = {1:2, 3:4}.values()
>>> bin(id(dv))
'0b11111110101110011010010110000001010101011110000'
>>> bin(hash(dv))
'0b1111111010111001101001011000000101010101111'
In other words:
>>> hash(dv) == id(dv) // 16
True
Thus, if compute_hash
in the original code is repeatedly called with temporary objects, it won't give useful results - the results don't depend on the contents of the object, and will commonly be the same, as temporary (i.e., immediately GCd) objects in a loop will often end up in the same memory location.
On the other hand, frozenset
is intended for long-term storage of some immutable data. Consequently, it's important and useful for it to define a __hash__
, and it does:
>>> f = frozenset(dv)
>>> bin(id(f))
'0b11111110101110011010001011101000110001011100000'
>>> bin(hash(f))
'0b101111010001101001001111100001000001100111011101101100000110001'
We can see the reference implementation for frozenset.__hash__
on GitHub. The value is cached after the first computation (after all, it's an immutable object, so there is no reason to repeat the calculation except to save memory), and otherwise it iterates over the hash table, XORs together sub-computations from the entries, and does some additional fixup. So we can be confident that this will be a correct hash value for the task in question.
It's hard to imagine solving the problem any more simply, since the only change to the non-working code is a single type coercion. If the log entries all have the same set of keys, however, it might make more sense to hard-code that into the hashing:
def hash_log_entry(entry):
return hash(entry['id']) ^ hash(entry['error']) ^ hash(entry['fruit'])
The existing code, however, is not technically correct; after all, a hash value contains only a limited number of bits, whereas we could be hashing arbitrarily large input sets. As such, a completely different set could be falsely registered as already_seen
(although it is extremely unlikely). More to the point, if our hash only cares about the .values
of the log
, disregarding order, then of course it would give the same hash value to another log
with the same set of values assigned to different keys - e.g. {'id': '1234', 'error': None, 'fruit': 'orange'}
vs. {'id': '1234', 'error': 'orange', 'fruit': None}
.
To work around that, one approach is to use a dict
instead, so as to keep track of the original values, and bin them according to matching hashes (since we now actually need to allow for collisions):
from collections import defaultdict
# We can't use a set to remember dicts with the same custom hash -
# that's the problem we were trying to work around in the first place!
already_seen = defaultdict(list)
for log in logs:
log_hash = compute_hash(log)
if log in already_seen.get(log_hash, ()):
continue
already_seen[log_hash].append(log)
But this is really duplicating the checking logic that a dict is supposed to implement for us.
Another approach is to use our own type for log entries:
from dataclasses import dataclass
from typing import Optional
# It's not very well explained in the documentation, but
# when the type is declared to be immutable using `frozen=True`,
# it will have a __hash__ generated as well.
@dataclass(frozen=True)
class LogEntry:
id: str
error: Optional[str]
fruit: str
# convert existing logs
logs = [LogEntry(**d) for d in logs]
# Now, since the type is hashable, we can directly deduplicate with a set:
set(logs)
CodePudding user response:
You have some working answers, but I think you may be overcomplicating things. Here's the quick fix I would have done on your original code.
logs = [
{'id': '1234', 'error': None, 'fruit': 'orange'},
{'id': '1234', 'error': None, 'fruit': 'orange'},
{'id': '12345', 'error': None, 'fruit': 'apple'},
]
def get_values(log: dict):
return tuple(log.values())
unique_logs = set(map(get_values, logs))
for log in unique_logs:
print(log)
('12345', None, 'apple')
('1234', None, 'orange')
CodePudding user response:
You can't base "already seen" on a hash. Hashes are smaller than the actual data but can have collisions, that is the tradeoff. Use hashes to group logs and then check for equality.
There will still be collisions.
Dicts already hash the data for you. Your data is dictionaries, which are unhashable because they are mutable. One simple way could be to convert your dicts to strings using json.dumps(). Or find something similar to frozenset but for dicts.
already_seen = set()
for log in logs:
log_hash = json.dumps(log)
if log_hash in already_seen:
continue
already_seen.add(log_hash)