I have a task about data matching between two lists in python,the lists like these below:
listA = [['a','b'],['c','d'],['a','f']]
listB = [['b','m'],['d','h'],['f','n'],['f','q']]
Now, i do the matching like this:
match_list=[]
for x in listA:
for y in listB:
if x[1] == y[0]:
match_list.append([x[0],y[1]])
the match_list will like this:[[a,m],[c,h],[a,n],[a,q]]
Now although the data matching can be achieved, it runs too slowly when there is a large amount of data in the list. I wonder if there is an easier way to accomplish this task
CodePudding user response:
You could convert listB
into a dictionary (using defaultdict
for convenience) with each first letter as a key having a value of all the corresponding second letters in listB
. So for your sample data:
from collections import defaultdict
dictB = defaultdict(list)
for [k, v] in listB:
dictB[k].append(v)
dictB
then has the value {'b': ['m'], 'd': ['h'], 'f': ['n', 'q']}
.
You can now iterate over listA
, adding to match_list
if the second letter in each sublist is present in dictB
:
match_list = []
for [v, k] in listA:
match_list = match_list [[v, v1] for v1 in dictB.get(k, [])]
Output:
[['a', 'm'], ['c', 'h'], ['a', 'n'], ['a', 'q']]
CodePudding user response:
If your second list is very long and most of the time does not match x==y
you still need quadratic runtime.
You could create a dictionary instead that captures what listA keys exists and to what listB values they match and use that:
listA = [['a','b'],['c','d'],['a','f']]
listB = [['b','m'],['d','h'],['f','n'],['f','q']]
# convert B to a dictionary for faster lookup,
# needs lists as values as keys duplicated
d ={}
for (key,value) in listB:
d.setdefault(key,[]).append(value)
result = []
for (value,key) in listA:
for v in d.get(key,[key]):
result.append([value,v])
print(result)
This uses O(len(listC)
) to create the dict and then O(len(listA) * max(map(len,dict.values)))
) wich may be considerably less then O(len(listA)*len(listC)
).
or for some more values:
from string import ascii_letters
listC = []
for l in "bdf":
for q in ascii_letters:
listC.append([l,q])
d ={}
for (key,value) in listC:
d.setdefault(key,[]).append(value)
result = []
for (value,key) in listA:
for v in d.get(key,[key]):
result.append([value,v])
print(result)
to get to
[['a', 'm'], ['c', 'h'], ['a', 'n'], ['a', 'q']]
[['a', 'a'], ['a', 'b'], ['a', 'c'], ['a', 'd'], ['a', 'e'], ['a', 'f'], ['a', 'g'], ['a', 'h'], ['a', 'i'], ['a', 'j'], ['a', 'k'], ['a', 'l'], ['a', 'm'], ['a', 'n'], ['a', 'o'], ['a', 'p'], ['a', 'q'], ['a', 'r'], ['a', 's'], ['a', 't'], ['a', 'u'], ['a', 'v'], ['a', 'w'], ['a', 'x'], ['a', 'y'], ['a', 'z'], ['a', 'A'], ['a', 'B'], ['a', 'C'], ['a', 'D'], ['a', 'E'], ['a', 'F'], ['a', 'G'], ['a', 'H'], ['a', 'I'], ['a', 'J'], ['a', 'K'], ['a', 'L'], ['a', 'M'], ['a', 'N'], ['a', 'O'], ['a', 'P'], ['a', 'Q'], ['a', 'R'], ['a', 'S'], ['a', 'T'], ['a', 'U'], ['a', 'V'], ['a', 'W'], ['a', 'X'], ['a', 'Y'], ['a', 'Z'], ['c', 'a'], ['c', 'b'], ['c', 'c'], ['c', 'd'], ['c', 'e'], ['c', 'f'], ['c', 'g'], ['c', 'h'], ['c', 'i'], ['c', 'j'], ['c', 'k'], ['c', 'l'], ['c', 'm'], ['c', 'n'], ['c', 'o'], ['c', 'p'], ['c', 'q'], ['c', 'r'], ['c', 's'], ['c', 't'], ['c', 'u'], ['c', 'v'], ['c', 'w'], ['c', 'x'], ['c', 'y'], ['c', 'z'], ['c', 'A'], ['c', 'B'], ['c', 'C'], ['c', 'D'], ['c', 'E'], ['c', 'F'], ['c', 'G'], ['c', 'H'], ['c', 'I'], ['c', 'J'], ['c', 'K'], ['c', 'L'], ['c', 'M'], ['c', 'N'], ['c', 'O'], ['c', 'P'], ['c', 'Q'], ['c', 'R'], ['c', 'S'], ['c', 'T'], ['c', 'U'], ['c', 'V'], ['c', 'W'], ['c', 'X'], ['c', 'Y'], ['c', 'Z'], ['a', 'a'], ['a', 'b'], ['a', 'c'], ['a', 'd'], ['a', 'e'], ['a', 'f'], ['a', 'g'], ['a', 'h'], ['a', 'i'], ['a', 'j'], ['a', 'k'], ['a', 'l'], ['a', 'm'], ['a', 'n'], ['a', 'o'], ['a', 'p'], ['a', 'q'], ['a', 'r'], ['a', 's'], ['a', 't'], ['a', 'u'], ['a', 'v'], ['a', 'w'], ['a', 'x'], ['a', 'y'], ['a', 'z'], ['a', 'A'], ['a', 'B'], ['a', 'C'], ['a', 'D'], ['a', 'E'], ['a', 'F'], ['a', 'G'], ['a', 'H'], ['a', 'I'], ['a', 'J'], ['a', 'K'], ['a', 'L'], ['a', 'M'], ['a', 'N'], ['a', 'O'], ['a', 'P'], ['a', 'Q'], ['a', 'R'], ['a', 'S'], ['a', 'T'], ['a', 'U'], ['a', 'V'], ['a', 'W'], ['a', 'X'], ['a', 'Y'], ['a', 'Z']]
You still need to create all those tiny lists which takes time - so you may be better off fully changing how you approach your problem - but we have not got enough infos to help with that.
Look into generators or other data structs to get better runtimes - but you may want to timeit
and analyse your bottlenecks first.
To make the whole dictioanry creating faster, look into collections.defaultdict(list)
as Nicks answer suggests.
CodePudding user response:
You could convert your lists in dictionnaries, whose parsing is very fast.
I may have done it using dict comprehension which is a good optimization, but here you have an use case where some keys can exist several time. You then have choice to have both list and dict comprehension nested or use these functions explained here: Convert list to dictionary with duplicate keys using dict comprehension
You will need to install iteration_utilities
library.
from iteration_utilities import groupedby
from operator import itemgetter
list_a = [['a','b'],['c','d'],['a','f']]
list_b = [['b','m'],['d','h'],['f','n'],['f','q']]
dict_a = groupedby(list_a, key=itemgetter(1), keep=itemgetter(0))
dict_b = groupedby(list_b, key=itemgetter(0), keep=itemgetter(1))
match_list=[]
for key in dict_a:
if key in dict_b:
for a_elt in dict_a[key]:
for b_elt in dict_b[key]:
match_list.append([a_elt, b_elt])
print(match_list)
Result will be
[['a', 'm'], ['c', 'h'], ['a', 'q']]
Note here that the dict conversion has a bad side effect that may not be compatible with your use case, if several elements are associated with the same key.
CodePudding user response:
Comparison of your and my approach using timeit:
from string import ascii_letters
def quad():
match_list=[]
for x in listA:
for y in listB:
if x[1] == y[0]:
match_list.append([x[0],y[1]])
return match_list
def doIt():
d ={}
for (key,value) in listB:
d.setdefault(key,[]).append(value)
result = []
for (value,key) in listA:
for v in d.get(key,[key]):
result.append([value,v])
return result
import timeit
listA = [ [a,b] for a in ascii_letters[:10] for b in ascii_letters]
listB = [ [a,b] for a in ascii_letters for b in ascii_letters]
print(timeit.timeit(quad, globals=globals(), number=100), "yours")
print(timeit.timeit(doIt, globals=globals(), number=100), "mine")
q = quad()
d = doIt()
print("A B Results Results Identical")
print(len(listA), len(listB), len(quad()), len(doIt()), d==q)
Output:
14.989092700000583 yours (quadratic)
0.4518431000005876 mine (using dict)
A B Results and Results Identical
520 2704 27040 27040 True
The main part of the time is spent creating your sublists - the one-time effort of creating the dict can be sped up using collections.defaultdict
but that one is tiny compared to the remaining operations when creating the result-list. Still: Nicks solution will be somewhat faster.
CodePudding user response:
There are many good answers to code better and more efficiently. Now I just want to bring out the usage of caching since your loop is just iterating over the same listB.
#Before caching
def loop_before_caching():
match_list=[]
for x in listA:
for y in listB:
if x[1] == y[0]:
match_list.append([x[0],y[1]])
return match_list
%timeit loop_before_caching()
3.52 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#After caching
from functools import lru_cache
@lru_cache(maxsize = 512)
def loop_after_caching():
match_list=[]
for x in listA:
for y in listB:
if x[1] == y[0]:
match_list.append([x[0],y[1]])
return match_list
%timeit loop_after_caching()
90 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Try with your real dataset and see if it helps you in terms of speed. Also, remember to make your data the global variables.
Extra
Just to give a better view on caching, I tried to update the listB by multiplying it with 100. Here's what I got
%timeit loop_before_caching()
216 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit loop_after_caching()
84.3 ns ± 6.57 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)