Python compare lists in lists with fixed entries-CodePudding

So I have this script,

#      g    f    z   n
a = [(4264, 7, 1526, 0),
     (4293, 14, 846, 93),
     (4174, 6, 962, 0),
     (4256, 12, 121, 0),
     (4257, 29, 182, 385)
     ]

#list a ca. 200,000 entries


#      g    f    z   n  id
b = [(4264, 10, 397, 0, 113),
     (4264, 20, 95, 0, 114),
     (4279, 13, 41, 0, 115),
     (4293, 14, 846, 93, 116),
     (4264, 8, 94, 0, 117),
     (4264, 8, 92, 0, 118),
     (4256, 12, 121, 0,119),
     (4264, 9, 293, 82, 120),
     (4264, 9, 288, 0, 121),
     (4264, 8, 90, 25, 122),
     (4264, 9, 156, 0, 123)
     ]

#list b ca. 1,000,000 entries


# My approach works, but takes a very very long time (over all entries)! Does anyone know a faster method?
for i in a:
        for x in b:
                if i[0] == x[0] and i[1] == x[1] and i[2] == x[2] and i[3] == x[3]:
                        xyz.append(x[4])

print(xyz)

#                           id                           
#### The desired result:  [(116),
#                          (119)
#                          ]

The individual entries (lists) in the lists are fixed and differ only at the end by the "ID" in list b.

The aim is to get the ID's from list b where the entry (list) exists in list a.

My approach works, but takes a very very long time (over all entries,list a ca. 200,000 and list b ca. 1,000,000 entries)!

Does anyone know a faster method?

CodePudding user response：

With thanks to @ThePyGuy I demonstrate two approaches using randomly generated lists of sizes equivalent to that required by OP

from random import randint as R
from time import perf_counter
import pandas as pd

A = 200_000
B = 1_000_000

# the time taken to build the lists is not timed as it's not relevant
a = [(R(1, A), R(1, A), R(1, A), R(1, A)) for _ in range(A)]
b = [(R(1, B), R(1, B), R(1, B), R(1, B), R(1, B)) for _ in range(B)]

_start = perf_counter()
d = {(g, f, z, n): i for g, f, z, n, i in b}  # build dictionary
# iterate over list "a" looking for matches in the dictionary
for t in a:
    if (_id := d.get(t)):
        print(_id)
_end = perf_counter()
print(f'{_end-_start:.4f}')

# timing of the pandas solution
_start = perf_counter()
pd.DataFrame(a).merge(pd.DataFrame(b)).iloc[:, -1].drop_duplicates().tolist()
_end = perf_counter()
print(f'{_end-_start:.4f}')

Output:

0.3319
1.1606

...which, perhaps surprisingly, shows that pandas does not offer optimum performance for this use-case.

In case it's relevant: Python 3.10.7, macOS 12.6, 3 GHz 10-Core Intel Xeon W, 32GB RAM, pandas 1.4.4

CodePudding user response：

If you don't mind using pandas DataFrame, here is one solution using pandas inner merge:

>>> import pandas as pd
>>> pd.DataFrame(a).merge(pd.DataFrame(b)).iloc[:,-1].drop_duplicates().tolist()
[116, 119]

And since its pandas, it should be fast enough.

CodePudding user response：

You can try using heapq. Some along the lines of:

from heapq import heappop, heappush


c = []

while any(a):
    t = a.pop()
    heappush(c, (t[0], t[1], t[-1]))

while any(b):
    t = b.pop()
    heappush(c, (t[0], t[1], t[-1])

matches = []
while any(c):
    try:
        t1 = heappop(c)
        t2 = heappop(c)
        if t1[0:2] == t2[0:2]:
            matches.append(t[-1])
            matches.append(t[-1])
    except:
        print("end of lists")
print(matches)

There are probably less “verbose” ways to use the module, read the doc to find out more about usage.