I am working with a very large dataset, and am looping over chunks of data to add elements to a class. There are many duplicated values in my data, meaning that I am creating a class instance for the same data many times. From some of the testing I've done, it seems that actually creating the instance of the class is the most expensive part of the operation so I want to minimise this as much as possible.
My question is: What is the least expensive (time) way of avoiding creating duplicate class instances? Ideally I would like to create a class instance once only and all duplicates reference the same instance. I can't remove duplicates from my data at the outset, but I want to make sure I minimise any costly procedures.
Here is a toy example that I hope illustrates my problem. The commented out section shows my thinking for how I might be able to shave off time.
In this example Person
contains 2 methods that call sleep
to demonstrate a time cost to creating an instance. In my example, the code will run in 4.22 seconds ((SLEEP_1 * 6) (SLEEP_2 * 6)
). Seeing as I have a person "James" present 3 times, I am looking to find a way to add this person only once, and then reference this for the 2 duplicates.
I would then expect the code to run in ~2.8s ((SLEEP_1 * 4) (SLEEP_2 * 4)
)
import time
from collections import defaultdict
SLEEP_1 = 0.2
SLEEP_2 = 0.5
# A class `Person` has a load of methods,
# meaning that creating an instance has a non-negligible time-cost over millions of calls.
class Person:
def __init__(self, info):
self._id = info['_id']
self.name = info['name']
self.nationality = info['nationality']
self.age = info['age']
self.can_drink_in_USA = self.some_long_fun()
self.can_fly_solo = self.another_costly_fun()
def some_long_fun(self):
time.sleep(SLEEP_1)
if self.age >= 21:
return True
return False
def another_costly_fun(self):
time.sleep(SLEEP_2)
if self.age >= 18:
return True
return False
# Some data to iterate over
# Note that "James" is present 3 times
teams = {
"team1": [
{"_id": "foo", "name": "James", "nationality": "French", "age": 32},
{"_id": "bar", "name": "Frank", "nationality": "American", "age": 36},
{"_id": "foo", "name": "James", "nationality": "French", "age": 32}
],
"team2": [
{"_id": "foo", "name": "James", "nationality": "French", "age": 32},
{"_id": "baz", "name": "Oliver", "nationality": "British", "age": 26},
{"_id": "qux", "name": "Josh", "nationality": "British", "age": 42}
]
}
seen = defaultdict(int)
team_directory = defaultdict(list)
start_time = time.time()
for team in teams:
for i, person in enumerate(teams[team]):
if person['_id'] in seen:
print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
# p = getattr(Person, '_id') == person['_id']
# team_directory[team].append(p)
# continue
print(f"Person {i 1} = {person['name']}")
p = Person(info=person)
team_directory[team].append(p)
seen[person['_id']] = 1
finish_time = time.time() - start_time
expected_finish = round((SLEEP_1 * 6) (SLEEP_2 * 6), 2)
print(f"Built a teams directory in {round(finish_time, 2)}s [expect: {expected_finish}s]")
# Loop over the results to check - I want each team to have 3 people
# (so I can't squash duplicates from the outset
for t in team_directory:
roster = " ".join([p.name for p in team_directory[t]])
print(f"Team {team} contains these people: {roster}")
CodePudding user response:
seen
can be used as cache to associate the person _id
with an already created Person
object.
This can look like (code up to and including the main for-loop, remaining code doesn't need a change):
seen = {}
team_directory = defaultdict(list)
start_time = time.time()
for team in teams:
for i, person in enumerate(teams[team]):
if person['_id'] in seen:
print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
p = seen[person['_id']]
team_directory[team].append(p)
continue
print(f"Person {i 1} = {person['name']}")
p = Person(info=person)
team_directory[team].append(p)
seen[person['_id']] = p
An assignment like e. g. seen[person['_id']] = p
only copies a reference to an object but not the object itself, therefore it doesn't need much memory.
CodePudding user response:
creating an instance has a non-negligible time-cost over millions of calls
Then don't call them. Your two examples are derived functions; they use other attributes, and therefore can remain instance methods, which need not be stored in instance fields themselves. Plus, you never use them in the code outside of the constructor, so they can be removed from there and deferred to whatever code actually needs them.
Also, you only need one function for that example code, and no sleeps
def age_check(age):
def f(over):
return age >= over
return f
age_check(self.age)(18)
age_check(self.age)(21)
Or, simpler
def age_check(self, over):
return self.age >= over
need to reference the instance where
Person._id == person['_id']
and I'm not sure how to do this efficiently / at all. Ultimately, I need to add this:team_directory[team].append(p)
Don't use a list and append. Use a dict that maps the Person._id
to the person instance itself. Then you don't need to waste cycles iterating over the list to see if a person exists already
Obviously, this all assumes your dataset will fit in memory