How to find similar elements from lists and merge them into one list in python?-CodePudding

I have a dictionary as follows:

{'similar_record_id': ['A-1','A-2','A-3','A-4','A-5'],

 'idx_pair' : [[25, 26],[835, 836],[834, 836, 835],[67, 69, 68],[62, 68, 66, 69, 67, 65]
  
}

Here idx_pair will have a list of lists, here some of pairs are similar to another list of elements as follows -

[835,836] is existed in this [834,836,835] so merge it as [834,835,836]

[67,69,68] is existed in this [62, 68, 66, 69, 67, 65] so merge it as [62,68,66,69,67,65]

If any of element from a list is existed in another list they should be merged.

expected output as:

{'similar_record_id': ['A-1','A-2','A-3'],

'idx_pair' : [[25, 26],[834,835, 836],[62, 68, 66, 69, 67, 65]
  
}

CodePudding user response：

One approach using graph theory, via networkx.connected_components:

import networkx as nx
from itertools import combinations

data = {'similar_record_id': ['A-1', 'A-2', 'A-3', 'A-4', 'A-5'],
        'idx_pair': [[25, 26], [835, 836], [834, 836, 835], [67, 69, 68], [62, 68, 66, 69, 67, 65]]}

# a dict to speed up search of idx_pairs by similar_record_id
lookup = {k: set(v) for k, v in zip(data["similar_record_id"], data["idx_pair"])}

# create a graph with nodes from the values of data["similar_record_id"]
g = nx.Graph()
g.add_nodes_from(data["similar_record_id"])

# if two idx_pair share an element add and edge to the graph
for s, t in combinations(lookup, r=2):
    if lookup[s] & lookup[t]:  # if the intersection is not empty
        g.add_edge(s, t)

# from the connected components create the output
res = {}
for component in nx.connected_components(g):
    representative = min(component, key=data['similar_record_id'].index)
    union = set().union(*(lookup[c] for c in component))
    res[representative] = union

print(res)

Output

{'A-1': {25, 26}, 'A-2': {834, 835, 836}, 'A-4': {65, 66, 67, 68, 69, 62}}

The idea is to use connected components to find the similar_record_id values that should be merged together.

CodePudding user response：

I would harness set arithemtics for this task as follows

data = [[25, 26],[835, 836],[834, 836, 835],[67, 69, 68],[62, 68, 66, 69, 67, 65]]
merged = []
for d in data:
    d_set = set(d)
    for m in merged:
        if d_set.intersection(m):
            m.extend(d)
            break
    else:
        merged.append(d)
print(merged)

gives output

[[25, 26], [835, 836, 834, 836, 835], [67, 69, 68, 62, 68, 66, 69, 67, 65]]

Explanation: I prepare empty list merged then iterate over data and extend first list in merged which have any common item with currenctly considered element of data (i.e. where intersection is not empty set), if there is not such element in merged I append data element to merge. Note usage of for-else where body of else is executed iff break was not used.