Home > Software engineering >  merge 2 list of group by intersection
merge 2 list of group by intersection

Time:10-25

I have two list of groups, list element format is [name, group_id]:

lst1 = [
    ['apple', 1],
    ['banana', 1],
    ['orange', 1],
    ['123', 2],
    ['456', 2],
    ['abc', 3],
    ['ABC', 3],
    ['tony', 4],
    ['john', 4],
    ['jack', 4],
]

lst2 = [
    ['!@#', 1],
    ['apple', 2],
    ['banana', 2],
    ['strawberry', 2],
    ['lemon', 2],    
    ['john', 3],    
    ['tony', 3],
    ['adella', 3],
]

I want to merge 2 list by intersection of name, which means merge 2 group if they have the most value in common (Final group_id is not important). The result looks like:

lst = [
    ['apple', 1],
    ['banana', 1],
    ['orange', 1],
    ['strawberry', 1],
    ['lemon', 1],       
    ['!@#', 2],
    ['john', 3],    
    ['tony', 3],
    ['adella', 3],   
    ['jack', 3],    
    ['123', 4],
    ['456', 4],
    ['abc', 5],
    ['ABC', 5],   
]

How can I do this efficiently?

CodePudding user response:

Here is a working solution. It is not optimal (O(n**2)) as it needs to compare all elements of the first list to that of the second. I hope someone comes up with a better algorithm, but in the meantime:

from itertools import groupby

# group elements with common id and transform to set
def to_set(l):
    return [set(e[0] for e in g)
            for k,g in groupby(l, key=lambda x: x[1])]

# find first element of set_list that overlaps s1
def match_set(s1, set_list):
    for s2 in set_list:
        if len(s1.intersection(s2)) > 0:
            return s1.union(s2)
    return s1

sets1 = to_set(lst1)
sets2 = to_set(lst2)

# perform merge both ways (to have "outer join")
out = {tuple(sorted(match_set(s1, sets2))) for s1 in sets1}
out = out.union({tuple(sorted(match_set(s2, out))) for s2 in sets2})

# annotate with new group
out = [[v, i] for i,t in enumerate(out) for v in t]

output:

[['apple', 0],
 ['banana', 0],
 ['lemon', 0],
 ['orange', 0],
 ['strawberry', 0],
 ['!@#', 1],
 ['123', 2],
 ['456', 2],
 ['ABC', 3],
 ['abc', 3],
 ['adella', 4],
 ['jack', 4],
 ['john', 4],
 ['tony', 4]]
  • Related