I have two lists:
- old_data has a length of 180,000.
- updated_data has a length of 184,000.
- Both lists contain IDs which are string values.
- There is a lot of overlap in IDs stored within the lists but I need to know which ones are no longer stored in updated_data so that I can say those IDs are not longer active in an SQL server.
Therefore, I need to check for any items in old_data that are not in updated_data and save them to a separate list, let's call it inactive_data.
I have the following code currently which is highly time-consuming and inefficient:
# Initialize list for no-longer active IDs to be saved into.
inactive_data = []
# Iteratively check if all IDs in old data are still in updated data.
# If they are not, add them to the list.
for Id in old_data:
if Id not in updated_data:
inactive_data.append(Id)
To run this code, it took approx. 45 minutes. I would to know how I could drastically reduce the time. If you have any suggestions please let me know!
CodePudding user response:
Give it a go
old_data = set(map(tuple, inactive_data))
new_data= set(map(tuple, updated_data))
print([i for i in old_dataif i not in new_data])
CodePudding user response:
The usual and easy way is to turn one or both of the list
into a set
.
An alternate way is to sort both lists and compare them element by element. Once they're ordered, the comparison is quick because it's O(n).
CodePudding user response:
You can first convert updated_data
into a set, and then use in
; in
operator on a set is very fast (O(1)).
import random
m, n = 180000, 184000
old_data = [random.randrange(m) for _ in range(m)]
updated_data = [random.randrange(n) for _ in range(n)]
updated_data = set(updated_data)
inactive_data = [x for x in old_data if x not in updated_data]
The above code takes less than 1 second on my machine.