Requirements
Can we write a function f
which:
- takes a single argument as input.
- returns an integer
- has the property that
f(x) == f(y)
if and only ifx
is exactly the same asy
orx
is a copy ofy
.
We say that "x
is exactly the same as y
or x
is a copy of y
" if at least one of the following conditions is met:
id(x) == id(y)
returnsTrue
repr(x) == repr(y)
returnsTrue
x is y
returnsTrue
y is x
returnsTrue
x == copy.deepcopy(y)
returnsTrue
y == copy.deepcopy(x)
returnsTrue
I realize that our function f
will never be fool-proof, but please assume that nobody is going to define a new class with a really weird __eq__
method.
Also assume that no adversary is going to define a new class with a truly bizarre __copy__
method or some sort of pathological __deepcopy__()
.
One Attempt at Defining f
My feeble attempt at defining a function f
is shown below:
pi = lambda k_1, k_2: (1/2)*(k_1 k_2)(k_1 k_2 1) k_2
def f(x:object) -> int:
"""
Based on something called "Cantor's pairing function"
from the field of mathematics.
"""
k_1 = hash(x)
k_2 = hash(type(x))
if k_1 < 0 or k_2 < 0:
raise NotImplementedError()
return pi(k_1, k_2)
Flaws in my solution
There are some issues:
- I worry that two different objects will have the same hash value. For example, maybe
hash([1, 2]) == hash([2, 3])
. The two lists[1, 2]
and[2, 3]
are different, but might have different hash values. Feel free to correct me if I am wrong. - my solution only works when
hash()
returns a non-negative integer.
id(x) == id(y)
versus x == y
x == y
is implemented as x.__eq__(y)
Note that the id
of an object is generally not the same as the id
of a copy of that object. Usually, an object and a copy of the object usually reside at different memory addresses.
The following code demonstrates the difference between __eq__()
and id()
:
x = [1, 2, 3]
y = [1, 2, 3]
print("id(x)".ljust(15), id(x))
print("id(y)".ljust(15), id(y))
print("x is x".ljust(15), x is x)
print("x is y".ljust(15), x is y)
print("x == x".ljust(15), x == x)
print("x == y".ljust(15), x == y)
We have:
id(x) 2531180442688
id(y) 2531180443264
x is x True
x is y False
x == x True
x == y True
Worry a little bit, but not too much about custom-made __repr__
or __eq__
I realize that our function f
will never be fool-proof, but please assume that nobody is going to define a new class with a really weird __eq__
, __hash__
method, or __eq__
method.
For example, someone could (in theory) define the following:
import random
class NoNoNo:
def __init__(self, *args):
self.myVal = random.randint(1, 9)
def __eq__(self, other):
return False
def __repr__(self):
return "$"
Objects instantiated from the NoNoNo
class are never equal to each-other.
import copy
x = NoNoNo()
y = copy.deepcopy(x)
print("x == y?", "Yes" if x == y else "No") # prints "x == y? No"
Also, repr(x)
is always the same as repr(y)
import copy
x = NoNoNo()
y = NoNoNo() # `y` IS VERY DIFFERENT FROM `x`
print("repr(x) == repr(y)?", "Yes" if repr(x) == repr(y) else "No") # prints "repr(x) == repr(y)? Yes"
CodePudding user response:
If I have understood your question correctly, it means that the weakest condition is condition [2], and that's the condition we want to aim for when implementing the function. (I hope the condition wasn't added by mistake).
Does the following function make sense?
def f(x):
return functools.reduce(lambda x,y: x *256 y, repr(x).encode('latin1'))
CodePudding user response:
I will limit myself to the case repr(x) == repr(y)
. This covers all except the deepcopy bases, assuming repr
is deterministic. Then it essentially comes down to compressing a string down to a single integer in a unique fashion. This (well, it's inverse) is known as a universal code.
A way to do that is to take the string and convert it to bytes. Take note of the length l
of the string. Check how many bits it takes to represent l
. Then, output that many 1
bits followed by a 0
. Then l
itself encoded in binary, and finally the bytes in, say, little-endian encoding.
def encode(s):
bytes = s.encode("utf-8")
l = len(s)
b = l.bit_length()
ones = (1 << b) - 1
lshift = l << (b 1)
bytesshift = (int.from_bytes(bytes, "little") << (2*b 1))
return ones | lshift | bytesshift
def decode(n):
b = 0
while n % 2:
b = 1
n >>= 1
n >>= 1
l = n & ((1 << b) - 1)
n >>= b
return n.to_bytes(l, "little").decode("utf-8")
CodePudding user response:
I don't think this is feasible to implement. With arbitrary types any sort of paring function will lead to massive ints to represent everything except the most simple structures.
Ideally you would have to pair the data type and the data together to get a unique integer. The data will need to be converted into an integer unique to the set of all possible data inputs, which in practice would become very large very fast. Paired with the data types integer representation you will end up with a huge number the majority of the time.
If you want to handle negative hash values you could have a look at using a Signed Cantor Pairing Function. You would still have hash collisions however.
The best way I can see if simply concatenating the hash of the data structure with the data itself but then you may as well just keep the objects as they are. This could also lead to hash collisions between data types but unless you use tonnes of types it's unlikely. You could also use a perfect hash but you would have to define the set of allowed datatypes beforehand.