Maybe I missed something, but I couldn't find an answer to this on SO or the Python docs.
Got curious about what actually happens under the hood when we write the following code:
maybe_smaller_set = set(some_huge_list)
Since sets don't have duplicates, I imagine there's something like a list comprehension or some clever trick (maybe using map()
or filter()
?) that occurs to avoid iterating over everything in some_huge_list
.
So my questions are these:
- How does Python actually do this conversion from a list instance into a set instance?
- And how costly is this conversion in terms of algorithmic complexity (ie, big-O)?
CodePudding user response:
Python set
s are based on hash tables. As such, insertion operations, per-element, are average/expected case O(1)
, though in pathological cases they can approach O(n)
.
There is no trick that can be used to avoid iterating the input list
though; you can't iterate all the elements without iterating all the elements in a general purpose data structure, no matter the language. The expected case for constructing a set
from a list
will be O(n)
, and the theoretical worst case scenario (if literally every element's hash code collides with every other element, but all elements are unique) would be O(n²)
.
As for how Python does it, it's nothing special; internally, it's roughly equivalent to:
newself = set()
newself.update(iterable)
which is in turn roughly equivalent to:
newself = set()
for elem in iterable:
newself.add(elem)
it just pushes all the work to the C layer (on the CPython reference interpreter), bypassing the Python layer work such a loop would entail.
To correct an additional misapprehension: There are no clever tricks for map
or filter
either. In the best case scenario (when called with a built-in function implemented in C), both of them just avoid a little Python level overhead that the equivalent generator expression would entail. But they're fully general purpose; they use no special tricks for list
or set
inputs or outputs, they just steadily grind through the input iterator, mapping/filtering each element as they're requested (when the map
/filter
object is itself iterated). They're definitionally O(n)
.