Avoid creating duplicate class instances-CodePudding

I am working with a very large dataset, and am looping over chunks of data to add elements to a class. There are many duplicated values in my data, meaning that I am creating a class instance for the same data many times. From some of the testing I've done, it seems that actually creating the instance of the class is the most expensive part of the operation so I want to minimise this as much as possible.

My question is: What is the least expensive (time) way of avoiding creating duplicate class instances? Ideally I would like to create a class instance once only and all duplicates reference the same instance. I can't remove duplicates from my data at the outset, but I want to make sure I minimise any costly procedures.

Here is a toy example that I hope illustrates my problem. The commented out section shows my thinking for how I might be able to shave off time.

In this example Person contains 2 methods that call sleep to demonstrate a time cost to creating an instance. In my example, the code will run in 4.22 seconds ((SLEEP_1 * 6) (SLEEP_2 * 6)). Seeing as I have a person "James" present 3 times, I am looking to find a way to add this person only once, and then reference this for the 2 duplicates.

I would then expect the code to run in ~2.8s ((SLEEP_1 * 4) (SLEEP_2 * 4))

import time
from collections import defaultdict

SLEEP_1 = 0.2
SLEEP_2 = 0.5

# A class `Person` has a load of methods, 
# meaning that creating an instance has a non-negligible time-cost over millions of calls. 
class Person:
    def __init__(self, info):
        self._id = info['_id']
        self.name = info['name']
        self.nationality = info['nationality']
        self.age = info['age']
        self.can_drink_in_USA = self.some_long_fun()
        self.can_fly_solo = self.another_costly_fun()

    def some_long_fun(self):
        time.sleep(SLEEP_1)
        if self.age >= 21:
            return True
        return False

    def another_costly_fun(self):
        time.sleep(SLEEP_2)
        if self.age >= 18:
            return True
        return False


# Some data to iterate over
# Note that "James" is present 3 times
teams = {
    "team1": [
        {"_id": "foo", "name": "James", "nationality": "French", "age": 32},
        {"_id": "bar", "name": "Frank", "nationality": "American", "age": 36},
        {"_id": "foo", "name": "James", "nationality": "French", "age": 32}
    ],
    "team2": [
        {"_id": "foo", "name": "James", "nationality": "French", "age": 32},
        {"_id": "baz", "name": "Oliver", "nationality": "British", "age": 26},
        {"_id": "qux", "name": "Josh", "nationality": "British", "age": 42}
    ]
}


seen = defaultdict(int)
team_directory = defaultdict(list)

start_time = time.time()
for team in teams:
    for i, person in enumerate(teams[team]):
        if person['_id'] in seen:
            print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
            # p = getattr(Person, '_id') == person['_id']
            # team_directory[team].append(p)
            # continue
        print(f"Person {i   1} = {person['name']}")
        p = Person(info=person)
        team_directory[team].append(p)
        seen[person['_id']]  = 1

finish_time = time.time() - start_time
expected_finish = round((SLEEP_1 * 6)   (SLEEP_2 * 6), 2)
print(f"Built a teams directory in {round(finish_time, 2)}s [expect: {expected_finish}s]")

# Loop over the results to check - I want each team to have 3 people
# (so I can't squash duplicates from the outset

for t in team_directory:
    roster = " ".join([p.name for p in team_directory[t]])
    print(f"Team {team} contains these people: {roster}")

CodePudding user response：

seen can be used as cache to associate the person _id with an already created Person object.

This can look like (code up to and including the main for-loop, remaining code doesn't need a change):

seen = {}
team_directory = defaultdict(list)

start_time = time.time()
for team in teams:
    for i, person in enumerate(teams[team]):
        if person['_id'] in seen:
            print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
            p = seen[person['_id']]
            team_directory[team].append(p)
            continue
        print(f"Person {i   1} = {person['name']}")
        p = Person(info=person)
        team_directory[team].append(p)
        seen[person['_id']] = p

An assignment like e. g. seen[person['_id']] = p only copies a reference to an object but not the object itself, therefore it doesn't need much memory.

CodePudding user response：

creating an instance has a non-negligible time-cost over millions of calls

Then don't call them. Your two examples are derived functions; they use other attributes, and therefore can remain instance methods, which need not be stored in instance fields themselves. Plus, you never use them in the code outside of the constructor, so they can be removed from there and deferred to whatever code actually needs them.

Also, you only need one function for that example code, and no sleeps

def age_check(age):
    def f(over):
        return age >= over
    return f

age_check(self.age)(18)
age_check(self.age)(21)

Or, simpler

def age_check(self, over):
    return self.age >= over

need to reference the instance where Person._id == person['_id'] and I'm not sure how to do this efficiently / at all. Ultimately, I need to add this: team_directory[team].append(p)

Don't use a list and append. Use a dict that maps the Person._id to the person instance itself. Then you don't need to waste cycles iterating over the list to see if a person exists already

Obviously, this all assumes your dataset will fit in memory