How to group all links in dictionary by using the first occurrence of a key as a starting point?-CodePudding

I have a dictionary like this,

matches = {'2-8-7 Yaesu, Chuo-Ku': ['Chuo Ward, Yaesu 2-8-7'],
           'Chuo Ward, Yaesu 2-8-7': ['2-8-7 Yaesu, Chuo-Ku'],
           'Fukuoka Bldg 10Th Floor': ['Fukuoka Building, 9Th -10Th Flr.'],
           'Fukuoka Bldg. 8-7 Yaesu Chome': ['2-8-7 Yaesu, Chuo-Ku', 
                                             'Fukuoka Building, 8-7, Yaesu 2 Chome, Chuo-Ku'],
           'Fukuoka Bldg. 9Th Fl': ['Fukuoka Building 9Th Floor'],
           'Fukuoka Building 9Th Floor': ['Fukuoka Bldg. 9Th Fl', 
                                          'Fukuoka Building, 9Th -10Th Flr.']}

I want to group them together by finding links (with keys or values), key can be anything (or) just the first key you come across that is the starting point.

This is the desired output I am looking forward to,

{'2-8-7 Yaesu, Chuo-Ku': ['Chuo Ward, Yaesu 2-8-7',
                          '2-8-7 Yaesu, Chuo-Ku',
                          'Fukuoka Building, 8-7, Yaesu 2 Chome, Chuo-Ku',
                          'Fukuoka Bldg. 8-7 Yaesu Chome'],
 'Fukuoka Bldg 10Th Floor': ['Fukuoka Building, 9Th -10Th Flr.',
                             'Fukuoka Bldg. 9Th Fl',
                             'Fukuoka Building 9Th Floor',
                             'Fukuoka Bldg. 9Th Fl']}

I have tried this,

unique_lst = set()
merged_matches = dict()
for key, values in matches.items():
    if key not in unique_lst:
        values_lst = []
        for v in values:
            output = matches.get(v)
            for subkeys, subvals in matches.items():
                if key != subkeys and v != subkeys:
                    keyvals = [subkeys]   list(subvals)
                    if v in keyvals:
                        values_lst.extend(keyvals)
            if output:
                values_lst.extend(output)
            values_lst.append(v)

        values_lst = [i for i in values_lst if i != key]
        values_lst = values_lst   [key]
        for v in values_lst:
            unique_lst.add(v)
            
        merged_matches[key] = values_lst

Here's the output I got,

# print(merged_matches)

{'Fukuoka Bldg. 9Th Fl': ['Fukuoka Building, 9Th -10Th Flr.',
                          'Fukuoka Building 9Th Floor',
                          'Fukuoka Bldg. 9Th Fl'],
 'Fukuoka Bldg. 8-7 Yaesu Chome': ['Chuo Ward, Yaesu 2-8-7',
                                   '2-8-7 Yaesu, Chuo-Ku',
                                   'Chuo Ward, Yaesu 2-8-7',
                                   '2-8-7 Yaesu, Chuo-Ku',
                                   'Fukuoka Building, 8-7, Yaesu 2 Chome, Chuo-Ku',
                                   'Fukuoka Bldg. 8-7 Yaesu Chome'],
 'Fukuoka Bldg 10Th Floor': ['Fukuoka Building 9Th Floor',
                             'Fukuoka Bldg. 9Th Fl',
                             'Fukuoka Building, 9Th -10Th Flr.',
                             'Fukuoka Building, 9Th -10Th Flr.',
                             'Fukuoka Bldg 10Th Floor']}

CodePudding user response：

IMO, the problem boils down to finding the connected components of a graph induced by the dictionary. One way you could do so is using the UnionFind datastructure to get the list of disjoint sets constructed from the keys and values.

Then we could construct a dictionary from the merged sets by selecting one element as key and the remainder as values.

from networkx.utils.union_find import UnionFind
c = UnionFind()
for k, lst in matches.items():
    c.union(*[k, *lst])

out = {k: v for k, *v in map(list, c.to_sets())}

Output:

{'Chuo Ward, Yaesu 2-8-7': ['2-8-7 Yaesu, Chuo-Ku',
  'Fukuoka Bldg. 8-7 Yaesu Chome',
  'Fukuoka Building, 8-7, Yaesu 2 Chome, Chuo-Ku'],
 'Fukuoka Building 9Th Floor': ['Fukuoka Bldg 10Th Floor',
  'Fukuoka Bldg. 9Th Fl',
  'Fukuoka Building, 9Th -10Th Flr.']}

CodePudding user response：

Trying to be creative with networkx, a network graph package. The idea is that your question can be translated into "classifying addresses into groups, with no linkage between groups."

So networkx sounds like a handy solution here.

Understanding the problem,

Denote the following:
1 '2-8-7 Yaesu, Chuo-Ku',
2 'Chuo Ward, Yaesu 2-8-7',
3 'Fukuoka Bldg 10Th Floor',
4 'Fukuoka Bldg. 8-7 Yaesu Chome',
5 'Fukuoka Bldg. 9Th Fl',
6 'Fukuoka Building 9Th Floor',
7 'Fukuoka Building, 8-7, Yaesu 2 Chome, Chuo-Ku',
8 'Fukuoka Building, 9Th -10Th Flr.'
---------
Reading from your original list of dictionaries, the addresses have the following relationships:

1 -> 2
2 -> 1
3 -> 8
4 -> 1
4 -> 7
5 -> 6
6 -> 5
6 -> 8

The final output is two dictionaries:
(1, 2, 7, 4) and (3, 5, 8, 6).

(1, 2, 7, 4) and (3, 5, 8, 6) is what we call, connected components. Hence the following implementation:

matches = {'2-8-7 Yaesu, Chuo-Ku': ['Chuo Ward, Yaesu 2-8-7'],
 'Chuo Ward, Yaesu 2-8-7': ['2-8-7 Yaesu, Chuo-Ku'],
 'Fukuoka Bldg 10Th Floor': ['Fukuoka Building, 9Th -10Th Flr.'],
 'Fukuoka Bldg. 8-7 Yaesu Chome': ['2-8-7 Yaesu, Chuo-Ku',
                                   'Fukuoka Building, 8-7, Yaesu 2 Chome, Chuo-Ku'],
 'Fukuoka Bldg. 9Th Fl': ['Fukuoka Building 9Th Floor'],
 'Fukuoka Building 9Th Floor': ['Fukuoka Bldg. 9Th Fl',
                                'Fukuoka Building, 9Th -10Th Flr.']}

import networkx as nx

G = nx.Graph()
for k,v in matches.items():
    G.add_node(k)
    for i in v:
        G.add_node(i)
        G.add_edge(k, i)

groups = list(nx.connected_components(G))

The final result groups is a list of set, where each set is an isolated group.