How in Python to organize and filter a collection of objects by a field value? I need to filter by being equal to an exact value and by being less than a value.
And how to do it effectively? If I store my objects in a list I need to iterate over a whole list, potentially holding hundreds of thousands of objects.
@dataclass
class Person:
name: str
salary: float
is_boss: bool
# if to store objects in a list...
collection = [Person("Jack", 50000, 0), ..., Person("Jane", 120000, 1)]
# filtering in O(n), sloooooow
target = 100000
filtered_collection = [x for x in collection if salary < target]
PS: Actually my use case is group by by a certain field, i.e. is_boss
and filter by another, i.e. salary
. How to do that? Should I user itertools.groupby
over sorted lists and make my objects comparable?
CodePudding user response:
If you maintain your list
in sorted order (which ideally means few insertions or removals, because mid-list
insertion/removal is itself O(n)
), you can find the set of Person
s below a given salary with the bisect
module.
from bisect import bisect
from operator import attrgetter
# if to store objects in a list...
collection = [Person("Jack", 50000, 0), ..., Person("Jane", 120000, 1)]
collection.sort(key=attrgetter('salary')) # O(n log n) initial sort
# filtering searches in O(log n):
target = 100000
filtered_collection = collection[:bisect(collection, target, key=attrgetter('salary'))]
Note: The key
argument to the various bisect
module functions is only supported as of 3.10. In prior versions, you'd need to define the rich comparison operators for Person
in terms of salary
and search for a faked out Person
object, or maintain ugly separate sorted list
s, one of salary
alone, and a parallel list
of the Person
objects.
For adding individual elements to the collection
, you could use bisect
's insort
function. Or you could just add a bunch of items to the end of the list
in bulk and resort it on the same key
as before (Python's sorting algorithm, TimSort, gets near O(n)
performance when the collection is mostly in order already, so the cost is not as high as you might think).
I'll note that in practice, this sort of scenario (massive data that can be arbitrarily ordered by multiple fields) usually calls for a database; you might consider using sqlite3
(eventually switching to a more production-grade database like MySQL or PostGres if needed), which, with appropriate indexes defined, would let you do O(log n)
SELECT
s on any indexed field; you could convert to Person
objects on extraction for the data you actually need to work with. The B-trees that true DBMS solutions provide get you O(log n)
effort for inserts, deletes and selects on the index fields, where Python built-in collection types make you choose; only one of insertion/deletion or searching can be truly O(log n)
, while the other is O(n)
.
CodePudding user response:
Arrays have a sort method - All you have to do is create a function that detirmes if an object is greater than another object - let me show you
class Foo:
def __init__(bar):
this.bar = bar
fooArray = [Foo(10),Foo(8),Foo(9)]
def sortFoo(foo):
return foo.bar
fooArray.sort(key=sortFoo)