Home > Software design >  Different access time to a value of a dictionary when mixing int and str keys
Different access time to a value of a dictionary when mixing int and str keys

Time:09-24

Let's say I have two dictionaries and I know want to measure the time needed to check if a key is in the dictionary. I tried to run this piece of code:

from timeit import timeit

dct1 = {str(i): 1 for i in range(10**7)}
dct2 = {i: 1 for i in range(10**7)}

print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8))
print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8))

Here are the results that I get:

2.529034548999334
2.212983401999736

Now, let's say I try to mix integers and strings in both dictionaries, and measure access time again:

dct1[7] = 1
dct2["7"] = 1

print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8))
print(timeit('7 in dct1', setup='from __main__ import dct1', number=10**8))
print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8))
print(timeit('"7" in dct2', setup='from __main__ import dct2', number=10**8))

I get something weird:

3.443614432000686
2.6335261530002754
2.1873921409987815
2.272667104998618

The first value is much higher than what I had before (3.44 vs 2.52). However, the third value is basically the same as before (2.18 vs 2.21). Why is this happening? Can you reproduce the same thing or is this only me? Also, I can't understand the big difference between the first and the second value: it looks like it's more difficult to access a string key, but the same thing seems to apply only slightly to the second dictionary. Why?

Update

You don't even need to actually add a new key. All you need to do to see an increase in complexity is just checking if a key with different type exists!! This is much weirder than I thought. Look at the example here:

from timeit import timeit

dct1 = {str(i): 1 for i in range(10**7)}
dct2 = {i: 1 for i in range(10**7)}

print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8))
# 2.55
print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8))
# 2.26

7 in dct1
"7" in dct2

print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8))
# 3.34
print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8))
# 2.35

CodePudding user response:

Let me try to answer my own question. The dict implementation in CPython is optimised for lookups of str keys. Indeed, there are two different functions that are used to perform lookups:

  • lookdict is a generic dictionary lookup function that is used with all types of keys
  • lookdict_unicode is a specialised lookup function used for dictionaries composed of str-only keys

Python will use the string-optimised version until a search for non-string data, after which the more general function is used.

And it looks like you cannot even reverse the behaviour of a particular dict instance: once it starts using the generic function, you can't go back to using the specialised one!

  • Related