Running a loop with class initialization is extremely slow. I need to intialize the class 30547200000 rows which will take me about 30 hours in the current state of the code. and I need to iterate over this process to find bugs etc.
Why is the second block so much slower? i.e with initializing a class and what can I do to make it faster? Note: The API of the function that I use needs a list of objects.
import dataclasses
import time
import pandas as pd
import numpy as np
# Without class intialization in loop.
df = pd.DataFrame(data={'a': np.arange(4320), 'b': np.arange(4320)})
tmp = list(zip(df.a, df.b))
start_time = time.time()
for _ in range(1000):
a = [(a, b) for a, b in tmp]
print('Without class:', time.time()-start_time)
Without class: 0.7349910736083984 seconds
# With class intialization in loop.
@dataclasses.dataclass
class SomeClass(object):
a: float
b: float
df = pd.DataFrame(data={'a': np.arange(4320), 'b': np.arange(4320)})
tmp = list(zip(df.a, df.b))
start_time = time.time()
for _ in range(1000):
a = [SomeClass(a=a, b=b) for a, b in tmp]
print('With class:', time.time()-start_time)
With class: 14.693351745605469 seconds
CodePudding user response:
dataclasses (and classes in general) can be a bit slow in Python. You can try specifying __slots__
to save some memory and perhaps time as well (IIRC this feature was removed at some Python version from dataclasses and recently was added back, so better check for your Python version if it is supported, and maybe consider switching to a regular class). Anyway, I have the feeling that if you currently get 30 hours, you'll (at best) get to a few hours. Still probably too much..
See here: https://docs.python.org/3/reference/datamodel.html#slots
But - Are you really going to create 30 billion objects?!
I'd argue (without knowing your usecase) that a better solution will be to avoid creating them at all.
For example - if you intend to create an object for all rows in your dataset, and then process them somehow to get an aggregated result, it will be more efficient to calculate the aggregated value first, and then create an object just for that one. But again - don't know your usecase so it's hard to give a good advice here.
CodePudding user response:
Python isn't great for writing highly efficient code. The best way to write efficient code in python is to not write it in python. That is to say use a library. You already have pandas and numpy so perhaps we can help with avoiding this all together and do the same thing with one of those.
Here's some advice for writing efficient python code
- loops are slow. That includes generators, list comprehension, and lambda functions that iterate over iterables. itertools and numpy are your friends. If you can avoid loops all together, it's most certainly going to be way faster.
- classes are slow. Python is mostly optimized for functional programming anyway, so there's usually another idiomatic way to do the same things unless you're literally dealing with objects. Data oriented also works here, like when you made tuples instead of an object. The most efficient is generics though (numbers and strings). Splitting out the data tends to lead to faster code.
- If all else fails, try compiling your code with cython or pypy.
CodePudding user response:
Sometimes things are just slow? That said, NamedTuple uses a similar syntax and cuts down on initialisation time (it might make stuff slower though. Performance of different structures differs for reads, writes, etc)
class SomeClass(NamedTuple):
a: float
b: float
df = pd.DataFrame(data={'a': np.arange(4320), 'b': np.arange(4320)})
tmp = list(zip(df.a, df.b))
start_time = time.time()
for _ in range(1000):
a = [SomeClass(a=a, b=b) for a, b in tmp]
print('With named tuple:', time.time()-start_time)