Efficient looping and comparing properties of two similar objects-CodePudding

I have a function find() that needs to loop through a lot of objects to identify a similar object by comparing a bunch of properties.

class Target:

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

class Source:

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c


def find(target: Target, source_set: set):
    for s in source_set:
        if s.a == target.a:
            if s.b == target.b:
                if s.c == target.c:
                    print("Found!")


source_set = {
    Source(a=1, b=2, c=3),
    Source(a=4, b=2, c=4)
}

target = Target(a=4, b=2, c=4)

find(target, source_set)

The current function is very slow as my source_set can be millions.

The source_set creation and its Source objects can be adjusted (e.g. the type). The source_set itself is not modified after initialisation.

The Source objects creation's input is coming from a dict with the same properties. One Source's raw input data is like this:

{'a': '1', 'b': '2', 'c': '3'}

The source_set is searched with many targets.

Is there a nice way to be more efficient? I'm hoping to not need to change the data structure.

CodePudding user response：

Without any external libraries, you can modify the __hash__ method of each class

class Target:

    ...
       
    def __hash__(self):
        return hash(frozenset(self.__dict__.items()))


class Source:

    ...

    def __hash__(self):
        return hash(frozenset(self.__dict__.items()))

Now try:

count = len({hash(target),}.intersection(map(hash, source_set)))
print(count)

# Output
1

CodePudding user response：

Using Pandas:

# Python env: pip install pandas
# Miniconda env: conda install pandas
import pandas as pd

df = pd.DataFrame([s.__dict__ for s in source_set])
sr = pd.Series(target.__dict__)
print(df)
print(sr)

# Output of source_set
   a  b  c
0  4  2  4
1  1  2  3

# Output of target
a    4
b    2
c    4
dtype: int64

Find same rows:

>>> sr.eq(df).all(axis=1).sum()
1