I have a function find()
that needs to loop through a lot of objects to identify a similar object by comparing a bunch of properties.
class Target:
def __init__(self, a, b, c):
self.a = a
self.b = b
self.c = c
class Source:
def __init__(self, a, b, c):
self.a = a
self.b = b
self.c = c
def find(target: Target, source_set: set):
for s in source_set:
if s.a == target.a:
if s.b == target.b:
if s.c == target.c:
print("Found!")
source_set = {
Source(a=1, b=2, c=3),
Source(a=4, b=2, c=4)
}
target = Target(a=4, b=2, c=4)
find(target, source_set)
The current function is very slow as my source_set
can be millions.
The source_set creation and its Source
objects can be adjusted (e.g. the type). The source_set itself is not modified after initialisation.
The Source
objects creation's input is coming from a dict with the same properties. One Source
's raw input data is like this:
{'a': '1', 'b': '2', 'c': '3'}
The source_set is searched with many targets.
Is there a nice way to be more efficient? I'm hoping to not need to change the data structure.
CodePudding user response:
Without any external libraries, you can modify the __hash__
method of each class
class Target:
...
def __hash__(self):
return hash(frozenset(self.__dict__.items()))
class Source:
...
def __hash__(self):
return hash(frozenset(self.__dict__.items()))
Now try:
count = len({hash(target),}.intersection(map(hash, source_set)))
print(count)
# Output
1
CodePudding user response:
Using Pandas
:
# Python env: pip install pandas
# Miniconda env: conda install pandas
import pandas as pd
df = pd.DataFrame([s.__dict__ for s in source_set])
sr = pd.Series(target.__dict__)
print(df)
print(sr)
# Output of source_set
a b c
0 4 2 4
1 1 2 3
# Output of target
a 4
b 2
c 4
dtype: int64
Find same rows:
>>> sr.eq(df).all(axis=1).sum()
1