Home > Enterprise >  data matching between two lists with python
data matching between two lists with python

Time:04-11

I have a task about data matching between two lists in python,the lists like these below:

listA = [['a','b'],['c','d'],['a','f']]

listB = [['b','m'],['d','h'],['f','n'],['f','q']]

Now, i do the matching like this:

match_list=[]
for x in listA:
    for y in listB:
        if x[1] == y[0]:
            match_list.append([x[0],y[1]])

the match_list will like this:[[a,m],[c,h],[a,n],[a,q]]

Now although the data matching can be achieved, it runs too slowly when there is a large amount of data in the list. I wonder if there is an easier way to accomplish this task

CodePudding user response:

You could convert listB into a dictionary (using defaultdict for convenience) with each first letter as a key having a value of all the corresponding second letters in listB. So for your sample data:

from collections import defaultdict
dictB = defaultdict(list)
for [k, v] in listB:
    dictB[k].append(v)

dictB then has the value {'b': ['m'], 'd': ['h'], 'f': ['n', 'q']}.

You can now iterate over listA, adding to match_list if the second letter in each sublist is present in dictB:

match_list = []
for [v, k] in listA:
    match_list = match_list   [[v, v1] for v1 in dictB.get(k, [])]

Output:

[['a', 'm'], ['c', 'h'], ['a', 'n'], ['a', 'q']]

CodePudding user response:

If your second list is very long and most of the time does not match x==y you still need quadratic runtime.

You could create a dictionary instead that captures what listA keys exists and to what listB values they match and use that:

listA = [['a','b'],['c','d'],['a','f']]
listB = [['b','m'],['d','h'],['f','n'],['f','q']]


# convert B to a dictionary for faster lookup, 
# needs lists as values as keys duplicated

d ={}
for (key,value) in listB:
    d.setdefault(key,[]).append(value)

result = []

for (value,key) in listA:
    for v in d.get(key,[key]):
        result.append([value,v])

print(result)

This uses O(len(listC)) to create the dict and then O(len(listA) * max(map(len,dict.values)))) wich may be considerably less then O(len(listA)*len(listC)).

or for some more values:

from string import ascii_letters

listC = []
for l in "bdf":
    for q in ascii_letters:
        listC.append([l,q])

d ={}
for (key,value) in listC:
    d.setdefault(key,[]).append(value)

result = []

for (value,key) in listA:
    for v in d.get(key,[key]):
        result.append([value,v])

print(result)

to get to

[['a', 'm'], ['c', 'h'], ['a', 'n'], ['a', 'q']]

[['a', 'a'], ['a', 'b'], ['a', 'c'], ['a', 'd'], ['a', 'e'], ['a', 'f'], ['a', 'g'], ['a', 'h'], ['a', 'i'], ['a', 'j'], ['a', 'k'], ['a', 'l'], ['a', 'm'], ['a', 'n'], ['a', 'o'], ['a', 'p'], ['a', 'q'], ['a', 'r'], ['a', 's'], ['a', 't'], ['a', 'u'], ['a', 'v'], ['a', 'w'], ['a', 'x'], ['a', 'y'], ['a', 'z'], ['a', 'A'], ['a', 'B'], ['a', 'C'], ['a', 'D'], ['a', 'E'], ['a', 'F'], ['a', 'G'], ['a', 'H'], ['a', 'I'], ['a', 'J'], ['a', 'K'], ['a', 'L'], ['a', 'M'], ['a', 'N'], ['a', 'O'], ['a', 'P'], ['a', 'Q'], ['a', 'R'], ['a', 'S'], ['a', 'T'], ['a', 'U'], ['a', 'V'], ['a', 'W'], ['a', 'X'], ['a', 'Y'], ['a', 'Z'], ['c', 'a'], ['c', 'b'], ['c', 'c'], ['c', 'd'], ['c', 'e'], ['c', 'f'], ['c', 'g'], ['c', 'h'], ['c', 'i'], ['c', 'j'], ['c', 'k'], ['c', 'l'], ['c', 'm'], ['c', 'n'], ['c', 'o'], ['c', 'p'], ['c', 'q'], ['c', 'r'], ['c', 's'], ['c', 't'], ['c', 'u'], ['c', 'v'], ['c', 'w'], ['c', 'x'], ['c', 'y'], ['c', 'z'], ['c', 'A'], ['c', 'B'], ['c', 'C'], ['c', 'D'], ['c', 'E'], ['c', 'F'], ['c', 'G'], ['c', 'H'], ['c', 'I'], ['c', 'J'], ['c', 'K'], ['c', 'L'], ['c', 'M'], ['c', 'N'], ['c', 'O'], ['c', 'P'], ['c', 'Q'], ['c', 'R'], ['c', 'S'], ['c', 'T'], ['c', 'U'], ['c', 'V'], ['c', 'W'], ['c', 'X'], ['c', 'Y'], ['c', 'Z'], ['a', 'a'], ['a', 'b'], ['a', 'c'], ['a', 'd'], ['a', 'e'], ['a', 'f'], ['a', 'g'], ['a', 'h'], ['a', 'i'], ['a', 'j'], ['a', 'k'], ['a', 'l'], ['a', 'm'], ['a', 'n'], ['a', 'o'], ['a', 'p'], ['a', 'q'], ['a', 'r'], ['a', 's'], ['a', 't'], ['a', 'u'], ['a', 'v'], ['a', 'w'], ['a', 'x'], ['a', 'y'], ['a', 'z'], ['a', 'A'], ['a', 'B'], ['a', 'C'], ['a', 'D'], ['a', 'E'], ['a', 'F'], ['a', 'G'], ['a', 'H'], ['a', 'I'], ['a', 'J'], ['a', 'K'], ['a', 'L'], ['a', 'M'], ['a', 'N'], ['a', 'O'], ['a', 'P'], ['a', 'Q'], ['a', 'R'], ['a', 'S'], ['a', 'T'], ['a', 'U'], ['a', 'V'], ['a', 'W'], ['a', 'X'], ['a', 'Y'], ['a', 'Z']]

You still need to create all those tiny lists which takes time - so you may be better off fully changing how you approach your problem - but we have not got enough infos to help with that.

Look into generators or other data structs to get better runtimes - but you may want to timeit and analyse your bottlenecks first.

To make the whole dictioanry creating faster, look into collections.defaultdict(list) as Nicks answer suggests.

CodePudding user response:

You could convert your lists in dictionnaries, whose parsing is very fast.

I may have done it using dict comprehension which is a good optimization, but here you have an use case where some keys can exist several time. You then have choice to have both list and dict comprehension nested or use these functions explained here: Convert list to dictionary with duplicate keys using dict comprehension

You will need to install iteration_utilities library.


from iteration_utilities import groupedby
from operator import itemgetter

list_a = [['a','b'],['c','d'],['a','f']]
list_b = [['b','m'],['d','h'],['f','n'],['f','q']]

dict_a = groupedby(list_a, key=itemgetter(1), keep=itemgetter(0))
dict_b = groupedby(list_b, key=itemgetter(0), keep=itemgetter(1))

match_list=[]

for key in dict_a:
    if key in dict_b:
        for a_elt in dict_a[key]:
            for b_elt in dict_b[key]:
                match_list.append([a_elt, b_elt])

print(match_list)

Result will be

[['a', 'm'], ['c', 'h'], ['a', 'q']]

Note here that the dict conversion has a bad side effect that may not be compatible with your use case, if several elements are associated with the same key.

CodePudding user response:

Comparison of your and my approach using timeit:

from string import ascii_letters

def quad():
    match_list=[]
    for x in listA:
        for y in listB:
            if x[1] == y[0]:
                match_list.append([x[0],y[1]])
    return match_list

def doIt():
    d ={}
    for (key,value) in listB:
        d.setdefault(key,[]).append(value)

    result = []

    for (value,key) in listA:
        for v in d.get(key,[key]):
            result.append([value,v])
    return result

import timeit

listA = [ [a,b] for a in ascii_letters[:10] for b in ascii_letters] 
listB = [ [a,b] for a in ascii_letters for b in ascii_letters] 

print(timeit.timeit(quad, globals=globals(), number=100), "yours")
print(timeit.timeit(doIt, globals=globals(), number=100), "mine")

q = quad()
d = doIt()

print("A   B    Results    Results Identical")
print(len(listA), len(listB), len(quad()), len(doIt()), d==q)

Output:

14.989092700000583 yours (quadratic)
0.4518431000005876 mine  (using dict)

A   B     Results and Results Identical
520 2704 27040 27040 True

The main part of the time is spent creating your sublists - the one-time effort of creating the dict can be sped up using collections.defaultdict but that one is tiny compared to the remaining operations when creating the result-list. Still: Nicks solution will be somewhat faster.

CodePudding user response:

There are many good answers to code better and more efficiently. Now I just want to bring out the usage of caching since your loop is just iterating over the same listB.

#Before caching
def loop_before_caching():
    match_list=[]
    for x in listA:
        for y in listB:
            if x[1] == y[0]:
                match_list.append([x[0],y[1]])
    return match_list

%timeit loop_before_caching()
3.52 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#After caching
from functools import lru_cache

@lru_cache(maxsize = 512)
def loop_after_caching():
    match_list=[]
    for x in listA:
        for y in listB:
            if x[1] == y[0]:
                match_list.append([x[0],y[1]])
    return match_list

%timeit loop_after_caching()
90 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Try with your real dataset and see if it helps you in terms of speed. Also, remember to make your data the global variables.

Extra

Just to give a better view on caching, I tried to update the listB by multiplying it with 100. Here's what I got

%timeit loop_before_caching()
216 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit loop_after_caching()
84.3 ns ± 6.57 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  • Related