So I have this script,
# g f z n
a = [(4264, 7, 1526, 0),
(4293, 14, 846, 93),
(4174, 6, 962, 0),
(4256, 12, 121, 0),
(4257, 29, 182, 385)
]
#list a ca. 200,000 entries
# g f z n id
b = [(4264, 10, 397, 0, 113),
(4264, 20, 95, 0, 114),
(4279, 13, 41, 0, 115),
(4293, 14, 846, 93, 116),
(4264, 8, 94, 0, 117),
(4264, 8, 92, 0, 118),
(4256, 12, 121, 0,119),
(4264, 9, 293, 82, 120),
(4264, 9, 288, 0, 121),
(4264, 8, 90, 25, 122),
(4264, 9, 156, 0, 123)
]
#list b ca. 1,000,000 entries
# My approach works, but takes a very very long time (over all entries)! Does anyone know a faster method?
for i in a:
for x in b:
if i[0] == x[0] and i[1] == x[1] and i[2] == x[2] and i[3] == x[3]:
xyz.append(x[4])
print(xyz)
# id
#### The desired result: [(116),
# (119)
# ]
The individual entries (lists) in the lists are fixed and differ only at the end by the "ID" in list b.
The aim is to get the ID's from list b where the entry (list) exists in list a.
My approach works, but takes a very very long time (over all entries,list a ca. 200,000 and list b ca. 1,000,000 entries)!
Does anyone know a faster method?
CodePudding user response:
With thanks to @ThePyGuy I demonstrate two approaches using randomly generated lists of sizes equivalent to that required by OP
from random import randint as R
from time import perf_counter
import pandas as pd
A = 200_000
B = 1_000_000
# the time taken to build the lists is not timed as it's not relevant
a = [(R(1, A), R(1, A), R(1, A), R(1, A)) for _ in range(A)]
b = [(R(1, B), R(1, B), R(1, B), R(1, B), R(1, B)) for _ in range(B)]
_start = perf_counter()
d = {(g, f, z, n): i for g, f, z, n, i in b} # build dictionary
# iterate over list "a" looking for matches in the dictionary
for t in a:
if (_id := d.get(t)):
print(_id)
_end = perf_counter()
print(f'{_end-_start:.4f}')
# timing of the pandas solution
_start = perf_counter()
pd.DataFrame(a).merge(pd.DataFrame(b)).iloc[:, -1].drop_duplicates().tolist()
_end = perf_counter()
print(f'{_end-_start:.4f}')
Output:
0.3319
1.1606
...which, perhaps surprisingly, shows that pandas does not offer optimum performance for this use-case.
In case it's relevant: Python 3.10.7, macOS 12.6, 3 GHz 10-Core Intel Xeon W, 32GB RAM, pandas 1.4.4
CodePudding user response:
If you don't mind using pandas DataFrame, here is one solution using pandas inner merge:
>>> import pandas as pd
>>> pd.DataFrame(a).merge(pd.DataFrame(b)).iloc[:,-1].drop_duplicates().tolist()
[116, 119]
And since its pandas, it should be fast enough.
CodePudding user response:
You can try using heapq
. Some along the lines of:
from heapq import heappop, heappush
c = []
while any(a):
t = a.pop()
heappush(c, (t[0], t[1], t[-1]))
while any(b):
t = b.pop()
heappush(c, (t[0], t[1], t[-1])
matches = []
while any(c):
try:
t1 = heappop(c)
t2 = heappop(c)
if t1[0:2] == t2[0:2]:
matches.append(t[-1])
matches.append(t[-1])
except:
print("end of lists")
print(matches)
There are probably less “verbose” ways to use the module, read the doc to find out more about usage.