Requirements

Can we write a function f which:

takes a single argument as input.
returns an integer
has the property that f(x) == f(y) if and only if x is exactly the same as y or x is a copy of y.

We say that "x is exactly the same as y or x is a copy of y" if at least one of the following conditions is met:

id(x) == id(y) returns True
repr(x) == repr(y) returns True
x is y returns True
y is x returns True
x == copy.deepcopy(y) returns True
y == copy.deepcopy(x) returns True

I realize that our function f will never be fool-proof, but please assume that nobody is going to define a new class with a really weird __eq__ method.

Also assume that no adversary is going to define a new class with a truly bizarre __copy__ method or some sort of pathological __deepcopy__().

One Attempt at Defining `f`

My feeble attempt at defining a function f is shown below:

pi = lambda k_1, k_2: (1/2)*(k_1   k_2)(k_1   k_2   1) k_2

def f(x:object) -> int:
    """
        Based on something called "Cantor's pairing function"
        from the field of mathematics. 
    """   
    k_1 = hash(x)
    k_2 = hash(type(x))
    if k_1 < 0 or k_2 < 0:
        raise NotImplementedError()
    return pi(k_1, k_2)

Flaws in my solution

There are some issues:

I worry that two different objects will have the same hash value. For example, maybe hash([1, 2]) == hash([2, 3]). The two lists [1, 2] and [2, 3] are different, but might have different hash values. Feel free to correct me if I am wrong.
my solution only works when hash()returns a non-negative integer.

`id(x) == id(y)` versus `x == y`

x == y is implemented as x.__eq__(y)

Note that the id of an object is generally not the same as the id of a copy of that object. Usually, an object and a copy of the object usually reside at different memory addresses.

The following code demonstrates the difference between __eq__() and id():

x = [1, 2, 3]
y = [1, 2, 3]
print("id(x)".ljust(15), id(x))
print("id(y)".ljust(15), id(y))
print("x is x".ljust(15), x is x)
print("x is y".ljust(15), x is y)
print("x == x".ljust(15), x == x)
print("x == y".ljust(15), x == y)

We have:

id(x)           2531180442688
id(y)           2531180443264
x is x          True
x is y          False
x == x          True
x == y          True

Worry a little bit, but not too much about custom-made `repr` or `eq`

I realize that our function f will never be fool-proof, but please assume that nobody is going to define a new class with a really weird __eq__, __hash__ method, or __eq__ method.

For example, someone could (in theory) define the following:

import random

class NoNoNo:
    def __init__(self, *args):
        self.myVal = random.randint(1, 9)
    def __eq__(self, other):
        return False
    def __repr__(self):
        return "$"

Objects instantiated from the NoNoNo class are never equal to each-other.

import copy
x = NoNoNo()
y = copy.deepcopy(x)
print("x == y?", "Yes" if x == y else "No") # prints "x == y? No"

Also, repr(x) is always the same as repr(y)

import copy
x = NoNoNo()
y = NoNoNo() # `y` IS VERY DIFFERENT FROM `x`
print("repr(x) == repr(y)?", "Yes" if repr(x) == repr(y) else "No") # prints "repr(x) == repr(y)? Yes"

CodePudding user response：

If I have understood your question correctly, it means that the weakest condition is condition [2], and that's the condition we want to aim for when implementing the function. (I hope the condition wasn't added by mistake).

Does the following function make sense?

def f(x):
    return functools.reduce(lambda x,y: x *256   y, repr(x).encode('latin1'))

CodePudding user response：

I will limit myself to the case repr(x) == repr(y). This covers all except the deepcopy bases, assuming repr is deterministic. Then it essentially comes down to compressing a string down to a single integer in a unique fashion. This (well, it's inverse) is known as a universal code.

A way to do that is to take the string and convert it to bytes. Take note of the length l of the string. Check how many bits it takes to represent l. Then, output that many 1 bits followed by a 0. Then l itself encoded in binary, and finally the bytes in, say, little-endian encoding.

def encode(s):
    bytes = s.encode("utf-8")
    l = len(s)
    b = l.bit_length()
    ones = (1 << b) - 1
    lshift = l << (b   1)
    bytesshift = (int.from_bytes(bytes, "little") << (2*b   1))
    return ones | lshift | bytesshift

def decode(n):
    b = 0
    while n % 2:
        b  = 1
        n >>= 1
    n >>= 1

    l = n & ((1 << b) - 1)
    n >>= b
    return n.to_bytes(l, "little").decode("utf-8")

CodePudding user response：

I don't think this is feasible to implement. With arbitrary types any sort of paring function will lead to massive ints to represent everything except the most simple structures.

Ideally you would have to pair the data type and the data together to get a unique integer. The data will need to be converted into an integer unique to the set of all possible data inputs, which in practice would become very large very fast. Paired with the data types integer representation you will end up with a huge number the majority of the time.

If you want to handle negative hash values you could have a look at using a Signed Cantor Pairing Function. You would still have hash collisions however.

The best way I can see if simply concatenating the hash of the data structure with the data itself but then you may as well just keep the objects as they are. This could also lead to hash collisions between data types but unless you use tonnes of types it's unlikely. You could also use a perfect hash but you would have to define the set of allowed datatypes beforehand.

Requirements

One Attempt at Defining f

Flaws in my solution

id(x) == id(y) versus x == y

Worry a little bit, but not too much about custom-made __repr__ or __eq__

One Attempt at Defining `f`

`id(x) == id(y)` versus `x == y`

Worry a little bit, but not too much about custom-made `repr` or `eq`