Home > Software engineering >  Python: What is a faster way to check if items in a large list are not in another large list?
Python: What is a faster way to check if items in a large list are not in another large list?

Time:07-21

I have two lists:

  • old_data has a length of 180,000.
  • updated_data has a length of 184,000.
  • Both lists contain IDs which are string values.
  • There is a lot of overlap in IDs stored within the lists but I need to know which ones are no longer stored in updated_data so that I can say those IDs are not longer active in an SQL server.

Therefore, I need to check for any items in old_data that are not in updated_data and save them to a separate list, let's call it inactive_data.

I have the following code currently which is highly time-consuming and inefficient:

# Initialize list for no-longer active IDs to be saved into.
inactive_data = []

# Iteratively check if all IDs in old data are still in updated data.
# If they are not, add them to the list.
for Id in old_data:
    
    if Id not in updated_data:
        
        inactive_data.append(Id)

To run this code, it took approx. 45 minutes. I would to know how I could drastically reduce the time. If you have any suggestions please let me know!

CodePudding user response:

Give it a go

old_data = set(map(tuple, inactive_data))
new_data= set(map(tuple, updated_data))

print([i for i in old_dataif i not in new_data])

CodePudding user response:

The usual and easy way is to turn one or both of the list into a set.

An alternate way is to sort both lists and compare them element by element. Once they're ordered, the comparison is quick because it's O(n).

CodePudding user response:

You can first convert updated_data into a set, and then use in; in operator on a set is very fast (O(1)).

import random

m, n = 180000, 184000
old_data = [random.randrange(m) for _ in range(m)]
updated_data = [random.randrange(n) for _ in range(n)]

updated_data = set(updated_data)
inactive_data = [x for x in old_data if x not in updated_data]

The above code takes less than 1 second on my machine.

  • Related