What am I getting wrong analyzing Python's Dict insert running time?-CodePudding

As I understand the Python source code, inserting into a dict works something like:

insert(e)

Let j be the number of elements already in the allocated array and k be the size of the array. (With k being 8 at initialization).
If j 1 < 2/3k, insert e into the position hash(e)%k (let's not worry about collisions for now)
Else, resize the array into an array of twice the size. k is now twice is large and this takes O(k_prev)

For n inserts, there are log(n)-3 resizes of total time complexity O(log(n)) but all the resources online seem to say I should think of this operation as O(1)?

CodePudding user response：

When they say O(1), they mean "amortized cost". Some operations may be more expensive, but the average cost of an operation is still O(1).

Let's pretend we copy when the array is 3/4 full. The idea is still the same, but we can work with whole numbers. We copy 6 elements from and array of size 8 to 16. We copy 12 elements to the array of size 32. We copy 24 elements to the array of size 64.

[I'm using ^ to mean exponentation, not exclusive or.]

So we end up copying a total of 6(1 2 4 ... 2^n) elements into an array of 2^(n 3). But that first number is less than 6 * 2^(n 1) which is 1.5 * 2^(n * 3). So whatever size array we're currently at, the total amount of copying to get to the array is less than 1.5 times the size of the array.

CodePudding user response：

Rehash is actually O(n) cost, but given it's rare, and relatively cheap (the hashes are precomputed and cached, and you know none of the existing elements are equal, so a full bucket can be assumed non-equal), it's usually not a big deal.

So no, any given operation might actually be O(n). But much like list expansion, it's amortized O(1) insertions (technically, average case O(1) insertions; a given insertion even without rehashing could be O(n) if you managed to ensure all the items being inserted hash to the same hash modulo the underlying storage size, but Python minimizes this for built-in types).

The reason it remains O(1) amortized is that the average number of (re)insertions per element remains bound by a constant multiplier, which is ignorable in big-O terms. You might actually be averaging O(2) or O(3) per insertion, but because the number of rehashes goes down in direct proportion to the size (work done) of the rehashes, all it really does is mean you're doing insertion work more often, but it's not increasing the average per element cost as the dict grows.