Sum values of dictionary if any character within key strings match-CodePudding

Imagine I have a dictionary with string keys and integer values, like:

{'255': 8,
 '323': 1,
 '438': 1,
 '938': 2,
 '166': 1,
 '117': 10,
 '777': 2
}

I would like to create a new dictionary that contains the following information:

keys = some new unique string (it can be anything)
values = the sum of values from the above dictionary across all keys in the above dictionary that match any character within any of the matched keys.

In the above example, the result would look something like:

{
 'group1' : 12, # Sum of key values for '255' (8), '323' (1), '438' (1), '938' (2)
 'group2' : 13  # Sum of key values for '166' (1), '117' (10), '777' (2)
}

To clarify, group1 is created in the following way: 255 has a 2 which matches the 2 in 323. Then the 3's in 323 also match the 3 in 438. Finally, the 8 (or the 3) in 438 match the 8 (or 3) in 938. So adding the values for these keys in the same order we get 8 1 1 2 = 12.

None of the above key values (group1) are merged with the remaining key values (for keys 166, 117, 777; making up group2) because none of the characters from the group1 keys match any of the characters in the group2 keys.

Group2 is created in the same way by matching the 1 from 166 to the 1s in 117 and matching the 7 in 117 to the 7s in 777.

Final notes:

Only a single character needs to match at least one other single character between keys to be included in a "group"
The order in which I just explained the above summation of values shouldn't matter, the result should work beginning from any pair within that group
Key characters will always be digits
Reminder that the keys of the output dictionary can be anything. I used group1 and group2 for convenience

What is an efficient way to accomplish this?

I thought about converting the key values into a numpy matrix where the rows represent a single key and the columns represent the values of each character. However, going down this path gets fairly complicated quickly and I'd rather avoid it if there is a better way.

CodePudding user response：

You can use a union-find data structure to solve this problem. The networkx package provides an implementation, but there's nothing stopping you from writing your own.

In essence, we maintain a collection of disjoint sets. Initially, every string belongs to its own disjoint set. For each pair of strings, if they have letters in common, we union the disjoint sets they belong to. This eventually gives us the groups that we're looking for.

From here, we use the .to_sets() method to get the groupings, and compute the desired sum:

from networkx.utils.union_find import UnionFind

data = # dictionary in question, omitted for brevity
keys = list(data.keys())

uf = UnionFind(data.keys())

for outer_idx in range(len(keys)):
    for inner_idx in range(outer_idx   1, len(keys)):
        if set(keys[outer_idx]) & set(keys[inner_idx]):
            uf.union(keys[outer_idx], keys[inner_idx])

result = {}
for idx, group in enumerate(uf.to_sets()):
    result[idx] = sum(data[key] for key in group)

print(result)

This outputs:

{0: 12, 1: 13}

CodePudding user response：

Try:

d = {"255": 8, "323": 1, "438": 1, "938": 2, "166": 1, "117": 10, "777": 2}

keys, groups = list(d), {}
while keys:
    current = keys.pop()
    s_current = set(current)
    for k in groups:
        if s_current.intersection(k):
            groups[k].append(d[current])
            groups[current] = groups[k]
            break
    else:
        groups[current] = [d[current]]

out = {}
for k, v in groups.items():
    if id(v) not in out:
        out[id(v)] = sum(v)


print(out)

Prints:

{140318922050368: 13, 140318911221184: 12}

A little explanation:

We create a temporary dict groups where the keys are from original dictionary d and values are lists of values from the original dictionary which share a character with the key. Note: the lists are shared between the keys - total number of these lists are equal to total number of distinct groups.

CodePudding user response：

This is similar to @BrokenBenchmark's answer, but it is self-contained, and it also groups things by digit first, so there's no O(n^2) doubly-nested loop over all the strings.

from collections import defaultdict

def get_edges(strings):
    # group together common digits.
    by_digit = [[] for _ in range(10)]
    for i, s in enumerate(strings):
        for digit in map(int, s):
            by_digit[digit].append(i)

    # attach the first of each group to the rest.
    for digit_locations in by_digit:
        if digit_locations:
            x = digit_locations[0]
            for y in digit_locations[1:]:
                yield (x, y)

def union_find_groups(edges, n):
    parent = list(range(n))
    size = [1] * n

    def find(x):
        # find (path-splitting)
        while parent[x] != x:
            x, parent[x] = parent[x], parent[parent[x]]
        return x

    def union(x, y):
        x = find(x)
        y = find(y)
        if x == y:
            return
        # Union (by size)
        if size[x] < size[y]:
            x, y = y, x
        parent[y] = x
        size[x]  = size[y]

    for x, y in edges:
        union(x, y)
    
    result = defaultdict(list)
    for x in range(n):
        result[find(x)].append(x)

    return result.values()

def main(input_data):
    strings = list(input_data.keys())
    index_groups = union_find_groups(get_edges(strings), len(strings))
    string_groups = [[strings[i] for i in group] for group in index_groups]
    print(string_groups)
    results = [sum(input_data[s] for s in group) for group in string_groups]
    return {f"group{i}": total for i, total in enumerate(results)}

if __name__ == "__main__":
    result = main({
        '255': 8,
        '323': 1,
        '438': 1,
        '938': 2,
        '166': 1,
        '117': 10,
        '777': 2
    })
    print(result)

Result:

[['255', '323', '438', '938'], ['166', '117', '777']]
{'group0': 12, 'group1': 13}