Home > front end >  Efficient reverse order comparison of huge growing list in Python
Efficient reverse order comparison of huge growing list in Python

Time:04-05

In Python, my goal is to maintain a unique list of points (complex scalars, rounded), while steadily creating new ones with a function, like in this pseudo code

list_of_points = []

while True
   # generate new point according to some rule
   z = generate() 

   # check whether this point is already there
   if z not in list_of_points:
      list_of_points.append(z)
   
   if some_condition:
      break

Now list_of_points can become potentially huge (like 10 million entries or even more) during the process and duplicates are quite frequent. In fact about 50% of the time, a newly created point is already somewhere in the list. However, what I know is that oftentimes the already existing point is near the end of the list. Sometimes it is in the "bulk" and only very occasionally it can be found near the beginning.

This brought me to the idea of doing the search in reverse order. But how would I do this most efficiently (in terms of raw speed), given my potentially large list which grows during the process. Is the list container even the best way here?

I managed to gain some performance by doing this

list_of_points = []

while True
   # generate new point according to some rule
   z = generate() 

   # check very end of list
   if z in list_of_points[-10:]:
      continue

   # check deeper into the list
   if z in list_of_points[-100:-10]:
      continue

   # check the rest
   if z not in list_of_points[:-100]:
      list_of_points.append(z)

   if some_condition:
      break

Apparently, this is not very elegant. Using instead a second, FIFO-type container (collection.deque), gives about the same speed up.

CodePudding user response:

Your best bet might to be to use a set instead of a list, python sets use hashing to insert items, so it is very fast. And, you can skip the step of checking if an item is already in the list by simply trying to add it, if it is already in the set it wont be added since duplicates are not allowed.

Stealing your pseudo code axample

set_of_points = {}

while True
   # get size of set
   a = len(set_of_points)

   # generate new point according to some rule
   z = generate() 

   # try to add z to the set
   set_of_points.add(z)

   b = len(set_of_points)

   # if a == b  it was not added, thus already existed in the set

   if some_condition:
      break

CodePudding user response:

Use a set. This is what sets are for. Ah - you already have answer saying that. So my other comment: this part of your code appears to be incorrect:

   # check the rest
   if z not in list_of_points[100:]:
      list_of_points.append(z)

In context, I believe you meant to write list_of_points[:-100] there instead. You already checked the last 100, but, as is, you're skipping checking the first 100 instead.

But even better, use plain list_of_points. As the list grows longer, the cost to possibly do 100 redundant comparisons becomes trivial compared to the cost of copying len(list_of_points) - 100 elements

  • Related