Home > Software engineering >  Is python dict very slow for searching key?
Is python dict very slow for searching key?

Time:10-05

In the code below, rel_column is a string, and 'id_map' is a python dict and it has 2.2 million key-value pair.

if rel_column in id_map:
   node_ids = id_map[rel_column]

In debugging a speed bottleneck issue, my colleague said that this piece of code is the cause of the slowness. He said that the 'if' statement takes log(n) time, but my impression has been that for searching a key in a map or set, the complexity has been constant time O(1). Considering that the the total size is only 2.2 millions of the dict, this should not be the speed bottleneck.

I am confused. Is my colleague right?

CodePudding user response:

When it doubt, benchmark. Arbitrarily choosing random strings of length 8 for key and value, and using the size specified in the question.

import random
import string
import timeit

def randstr(N):
    return ''.join(random.choice(string.ascii_uppercase   string.digits) for _ in range(N))

id_map = {randstr(8): randstr(8) for _ in range(int(2e6))}

# key not present
x = randstr(8)
print(timeit.timeit(stmt=lambda: id_map[x] if x in id_map else None, number=1000000))

# key present, first with in/[], then with .get()
x = random.choice(list(id_map.keys()))
print(timeit.timeit(stmt=lambda: id_map[x] if x in id_map else None, number=1000000))
print(timeit.timeit(stmt=lambda: id_map.get(x, None), number=1000000))

Results:

0.11468726862221956
0.13692271895706654
0.13748164474964142

Conclusions (Python 3.9, well-distributed keys, small data, 2M entries):

  • key not found is slightly faster than key found
  • in followed by [] is about the same speed as .get()
  • Both are on the order of 100ns per lookup

CodePudding user response:

Dictionaries in Python will find keys in O(1) on average. But complexity is not the only factor in execution time. For example accessing a list item by its index is also O(1) but it is considerably faster than accessing a dictionary by its key.

I tried replacing your code with:

node_ids = id_map.get(rel_column,node_ids)

thinking that it could accelerate the process by making only one dictionary access but it didn't improve performance at all (it was actually slower).

In order to prove that dictionary key access isn't O(logN), I benchmarked accessing existing and missing keys from dictionaries of various sizes. If a binary search was used internally, performance would decrease with size and missing keys would be the worst case scenario thus requiring more time than existing keys (which they don't):

id_map1 = {str(i):str(i 1) for i in range(2_000_000)} # LogN = 20
id_map2 = {str(i):str(i 1) for i in range(100_000)}   # LogN = 16
id_map3 = {str(i):str(i 1) for i in range(5_000)}     # LogN = 12

highHits = ["-1","0","1000","2000","2"]*10000     # 50,000
lowHits  = ["-1","0","-1000","-2000","-2"]*10000  # 50,000

from timeit import timeit

print("2M/high  ",timeit(lambda: [id_map1[k] if k in id_map1 else None for k in highHits],number=1))
print("100K/high",timeit(lambda: [id_map2[k] if k in id_map2 else None for k in highHits],number=1))
print("5K/high  ",timeit(lambda: [id_map3[k] if k in id_map3 else None for k in highHits],number=1))

print("2M/low   ",timeit(lambda: [id_map1[k] if k in id_map1 else None for k in lowHits],number=1))
print("100K/low ",timeit(lambda: [id_map2[k] if k in id_map2 else None for k in lowHits],number=1))
print("5K/low   ",timeit(lambda: [id_map3[k] if k in id_map3 else None for k in lowHits],number=1))

Results:

# existing keys
2M/high   0.003838191999999907
100K/high 0.004719815000000072
5K/high   0.0046313949999996495

# Missing keys
2M/low    0.003382090999999754
100K/low  0.003280908000000249
5K/low    0.0034408550000000204

So, the performance doesn't decrease with the size of the dictionary (if anything it improves). And missing keys are identified faster than existing keys which precludes a binary search. So Python dictionaries do not have O(logN) complexity for key accesses.

  • Related