I have two list of groups, list element format is [name, group_id]:
lst1 = [
['apple', 1],
['banana', 1],
['orange', 1],
['123', 2],
['456', 2],
['abc', 3],
['ABC', 3],
['tony', 4],
['john', 4],
['jack', 4],
]
lst2 = [
['!@#', 1],
['apple', 2],
['banana', 2],
['strawberry', 2],
['lemon', 2],
['john', 3],
['tony', 3],
['adella', 3],
]
I want to merge 2 list by intersection of name, which means merge 2 group if they have the most value in common (Final group_id is not important). The result looks like:
lst = [
['apple', 1],
['banana', 1],
['orange', 1],
['strawberry', 1],
['lemon', 1],
['!@#', 2],
['john', 3],
['tony', 3],
['adella', 3],
['jack', 3],
['123', 4],
['456', 4],
['abc', 5],
['ABC', 5],
]
How can I do this efficiently?
CodePudding user response:
Here is a working solution. It is not optimal (O(n**2)) as it needs to compare all elements of the first list to that of the second. I hope someone comes up with a better algorithm, but in the meantime:
from itertools import groupby
# group elements with common id and transform to set
def to_set(l):
return [set(e[0] for e in g)
for k,g in groupby(l, key=lambda x: x[1])]
# find first element of set_list that overlaps s1
def match_set(s1, set_list):
for s2 in set_list:
if len(s1.intersection(s2)) > 0:
return s1.union(s2)
return s1
sets1 = to_set(lst1)
sets2 = to_set(lst2)
# perform merge both ways (to have "outer join")
out = {tuple(sorted(match_set(s1, sets2))) for s1 in sets1}
out = out.union({tuple(sorted(match_set(s2, out))) for s2 in sets2})
# annotate with new group
out = [[v, i] for i,t in enumerate(out) for v in t]
output:
[['apple', 0],
['banana', 0],
['lemon', 0],
['orange', 0],
['strawberry', 0],
['!@#', 1],
['123', 2],
['456', 2],
['ABC', 3],
['abc', 3],
['adella', 4],
['jack', 4],
['john', 4],
['tony', 4]]