Let's say in python3, we used a dictionary as:
my_dct = {}
...
for i in range(100):
my_dct[i] = True
...
for s in "potentially long string":
my_dct[s] = True
My question is when we run the above python program through the python interpreter, what capacity container does the interpreter create at the first line? Depending on this size, does the capacity increase during the course of the first or second for loop?
This link shows capacity 8. While this shows, according to the type, 128, 2^15, etc. as the capacity of the container upon the instantiation of a dictionary in python.
Would really appreciate any explanation, thank you.
CodePudding user response:
How the dict
storage adjusts as it grows is an implementation detail, which most users don't need to know. If you are developer, or want to write your own code in C, the details may be useful. But for choosing between lists and dicts, they don't matter.
The important difference between lists and dicts is how they are indexed. A list is indexed by integers (or slices), and 'contains' all values up to its len
. A dict is indexed by keys, which can be any hashable object. Strings are most common.
In [1]: import sys
On the surface these two objects are equivalent:
In [2]: adict = {i:True for i in range(100)}
In [3]: alist = [i for i in range(100)]
Both have len()
100, and can be indexed the same way:
In [4]: adict[50]
Out[4]: True
In [5]: alist[50]
Out[5]: 50
Oops, I meant to use alist = [True for i in range(100)]
, a list of 100 True
values, rather than list(range(100))
. But that doesn't change the following discussion.
A key difference is that the list indices have to be consecutive. That is if we want to use alist[200]
, we will have to add another 101 values to the list. Whereas adict[200]=False
simply adds the 200
to the hash table, with a new len
of 101.
Underlying storage is quite different.
alist
has a C array that contains references for 100 values, plus some 'head room' to allow for fast growth with alist.append(...)
.
adict
has a hash table, that somehow stores references to the keys and the values.
In both cases, as they grow, the storage ('container'?) will have to be reallocated when it fills up. Again the details are implementation dependent, and largely invisible to users - unless we carefully track memory or times.
I've seen a number of SO questions that try to compare the getsizeof
of list and arrays. The answers always point out that this comparison has limited value, since it measures very different things.
In [8]: sys.getsizeof(alist)
Out[8]: 904
In [9]: sys.getsizeof(adict)
Out[9]: 4696
For a list
, getsizeof
is basically the size of the C array that stores value references. For len(alist)
of 100, that's 8*100 bytes, with roughly growth space for 13 more values - at which point it will have to be reallocated.
Adding 6 values does not change size:
In [11]: alist.extend([True]*6)
In [12]: len(alist)
Out[12]: 106
In [13]: sys.getsizeof(alist)
Out[13]: 904
Adding another 6, forces a reallocation, with a larger head space (for 27 references):
In [14]: alist.extend([True]*6)
In [15]: len(alist)
Out[15]: 112
In [16]: sys.getsizeof(alist)
Out[16]: 1112
I assume the getsizeof(adict)
measures the size of the hashtable. That's quite a bit larger.
Adding 12 more keys to adict
doesn't change the size. There's enough space to handle these values without clashes.
In [21]: adict.update({i:True for i in range(100,106)})
In [22]: len(adict)
Out[22]: 106
In [23]: sys.getsizeof(adict)
Out[23]: 4696
In [24]: adict.update({i:True for i in range(106,112)})
In [25]: sys.getsizeof(adict)
Out[25]: 4696
For total memory foot print we also have to consider the memory use of the keys and values. In these examples, that isn't much, since all references to True
reference the same object. And ints up to 255 are also preallocated and unique. But keys can be strings, tuples, or more complex objects. And values can be an object, though adding them to a list or dict does increase memory use. They are stored by reference, not value.