I have a dictionary as follows:
{'similar_record_id': ['A-1','A-2','A-3','A-4','A-5'],
'idx_pair' : [[25, 26],[835, 836],[834, 836, 835],[67, 69, 68],[62, 68, 66, 69, 67, 65]
}
Here idx_pair will have a list of lists, here some of pairs are similar to another list of elements as follows -
[835,836] is existed in this [834,836,835] so merge it as [834,835,836]
[67,69,68] is existed in this [62, 68, 66, 69, 67, 65] so merge it as [62,68,66,69,67,65]
If any of element from a list is existed in another list they should be merged.
expected output as:
{'similar_record_id': ['A-1','A-2','A-3'],
'idx_pair' : [[25, 26],[834,835, 836],[62, 68, 66, 69, 67, 65]
}
CodePudding user response:
One approach using graph theory, via networkx.connected_components
:
import networkx as nx
from itertools import combinations
data = {'similar_record_id': ['A-1', 'A-2', 'A-3', 'A-4', 'A-5'],
'idx_pair': [[25, 26], [835, 836], [834, 836, 835], [67, 69, 68], [62, 68, 66, 69, 67, 65]]}
# a dict to speed up search of idx_pairs by similar_record_id
lookup = {k: set(v) for k, v in zip(data["similar_record_id"], data["idx_pair"])}
# create a graph with nodes from the values of data["similar_record_id"]
g = nx.Graph()
g.add_nodes_from(data["similar_record_id"])
# if two idx_pair share an element add and edge to the graph
for s, t in combinations(lookup, r=2):
if lookup[s] & lookup[t]: # if the intersection is not empty
g.add_edge(s, t)
# from the connected components create the output
res = {}
for component in nx.connected_components(g):
representative = min(component, key=data['similar_record_id'].index)
union = set().union(*(lookup[c] for c in component))
res[representative] = union
print(res)
Output
{'A-1': {25, 26}, 'A-2': {834, 835, 836}, 'A-4': {65, 66, 67, 68, 69, 62}}
The idea is to use connected components to find the similar_record_id
values that should be merged together.
CodePudding user response:
I would harness set arithemtics for this task as follows
data = [[25, 26],[835, 836],[834, 836, 835],[67, 69, 68],[62, 68, 66, 69, 67, 65]]
merged = []
for d in data:
d_set = set(d)
for m in merged:
if d_set.intersection(m):
m.extend(d)
break
else:
merged.append(d)
print(merged)
gives output
[[25, 26], [835, 836, 834, 836, 835], [67, 69, 68, 62, 68, 66, 69, 67, 65]]
Explanation: I prepare empty list merged
then iterate over data and extend first list in merged which have any common item with currenctly considered element of data (i.e. where intersection is not empty set), if there is not such element in merged I append data element to merge. Note usage of for-else
where body of else
is executed iff break
was not used.