How to group variables with the same value-CodePudding

Hi I want to write an efficient code, can I get help?

for example

When there are four dictionary variables below,

v1 = {'title':'title1', 'number1':1, 'number2':2, 'number3':3, 'number4':4, 'number5':5}
v2 = {'title':'title2', 'number1':1, 'number2':2, 'number3':3, 'number4':4, 'number5':55}
v3 = {'title':'title3', 'number1':1, 'number2':2, 'number3':33, 'number4':4, 'number5':55}

v4 = {'title':'title4', 'number1':1, 'number2':4567, 'number3':8910, 'number4':5177, 'number5':1511}

If the same number of values is 3 or more by comparing the values of the key 'number1', 'number2', 'number3', 'number4', and 'number5', we want to group them.

Expected result:

[['title1', 'title2', 'title3'], ['title4']]

It doesn't have to be a result like an expected result, anyway, it just needs to be grouped.

Can I help you? Thank you

CodePudding user response：

Here's an approach where groups are iteratively combined until there are no remaining ways to combine them:

vars = [
    {'title':'title1', 'number1':1, 'number2':2, 'number3':3, 'number4':4, 'number5':5},
    {'title':'title2', 'number1':1, 'number2':2, 'number3':3, 'number4':44, 'number5':55},
    {'title':'title3', 'number1':11, 'number2':22, 'number3':3, 'number4':4, 'number5':5},
    {'title':'title4', 'number1':1, 'number2':4567, 'number3':8910, 'number4':5177, 'number5':1511},
]
groups = [[var] for var in vars]

while True:
    for group in groups:
        for other in groups:
            if group == other:
                continue
            if any(sum(
                v == var2[k] 
                for k, v in var1.items() 
                if k.startswith("number")
            ) >= 3 for var1 in group for var2 in other):
                group.extend(other)
                groups.remove(other)
                break
        else:
            continue
        break
    else:
        break

print([[var['title'] for var in group] for group in groups])

prints:

[['title1', 'title2', 'title3'], ['title4']]

CodePudding user response：

For n variables you can make an nxn array that represents the connectedness for any pairwise combination. So that connected(n1,n2) is 1 if var n1 and n2 share more than 3 variables else 0. Then take a vector of n elements populated by zeros and a 1 in it for a given variable and multiply it by your connectedness matrix n times to get a vector where non-zero values represent membership in this grouping. After you've done this for one variable repeat for remaining variables that have not yet been included in a grouping.

CodePudding user response：

Here is another way to frame this problem using graphs.

Define a graph with nodes corresponding to titles and an edge drawn between to titles if the agree of at least 3 out of 5 number attributes. You can find connected components using networkx as follows:

vs = [v1, v2, v3, v4]

import networkx as nx
from itertools import combinations

num_cols = [f"number{i}" for i in range(1, 6)]

g = nx.Graph()

g.add_nodes_from([v["title"] for v in vs])
for v1, v2 in combinations(vs, 2):
  if sum(v1[k] == v2[k] for k in num_cols) >= 3:
    g.add_edge(v1["title"], v2["title"])

groups = [list(c) for c in nx.connected_components(g)]
# [['title1', 'title3', 'title2'], ['title4']]

Alternatively, you may want to find groups such that any two elements in the group agree on at least three number attributes. In this case, you will be looking for cliques in the graph:

cliques = [list(c) for c in nx.find_cliques(g)]
# [['title1', 'title3', 'title2'], ['title4']]

The two approaches result in the same output for this example, but not in general. In the second (clique) approach, an element may appear in multiple groups.