Home > Blockchain >  Find differences in list of strings
Find differences in list of strings

Time:11-10

input list:

['A.B.C.D','A.A.C.D','A.B.E.F','A.B.E.GG']

The strings are variables concatenated with a dot. So in first string, variables are A B C D, but in last string variable are A B E GG (variables can be of varying length), but the strings will always have the same number of variables separated by the dot. I would like to group those strings together which have only one different variable. So above would produce 2 groups.

['A.B.C.D','A.A.C.D'] ['A.B.E.F','A.B.E.GG']

The the difference has to be in the same variable. not one difference across all variables. and each string should be included only once, not multiple times across groups.

I wrote an algorithm which does it but is too complicated. I also tried via itertools groupby but could not get it to produce the correct results:

import itertools
import difflib

class Grouper:
    def __init__(self, diff):
        self.last = None
        self.diff = diff
    def get_diffs(self, curr_key):
        if self.last == None:
            return []
        dims = curr_key.split('.')
        previous_dims = self.last.split('.')
        diffs = list((dm for dm in difflib.ndiff(dims, previous_dims) if '-' not in dm and '?' not in dm))
        return [n for n, d in enumerate(diffs) if ' ' in d]
    
    def __call__(self, item):
        result = self.get_diffs(item)
        print(item)
        self.last = item
        return result


grp = Grouper(data)
groups = []
uniquekeys = []

for k, g in itertools.groupby(data, grp):
    groups.append(list(g))      # Store group iterator as a list
    uniquekeys.append(k)
    

CodePudding user response:

I added another string "A.B.C.F", this string can be matched with "A.B.C.D" and "A.B.E.F":

import itertools


def main(l: list) -> list:
    splits = [tuple(s.split(".")) for s in l]

    groups = {}
    # Dict of {(tuple, index of difference): list of tuples that match the key}

    for split in splits:
        matched = False
        for (t, i) in groups.keys():
            # We match two splits if they only have one difference
            diffs = [i for i in range(len(split)) if split[i] != t[i]]
            if len(diffs) == 1:
                # The strings match but is the first match of 't'?
                if i is None:
                    groups.pop((t, i))
                    groups[(t, diffs[0])] = [split]
                    # We found a match, stop the matching of this tuple
                    matched = True
                    break
                elif diffs[0] == i:
                    groups[(t, i)].append(split)
                    matched = True
                    break
        if not matched:
            # Let's add this split as a candidate for future splits
            groups[(split, None)] = []

    return [[".".join(k)]   [".".join(s) for s in v] for (k, i), v in groups.items()]


print(main(["A.B.C.D", "A.A.C.D", "A.B.E.F", "A.B.E.GG", "A.B.C.F"]))
# [['A.B.C.D', 'A.A.C.D'], ['A.B.E.F', 'A.B.E.GG'], ['A.B.C.F']]
print(main(["A.B.C", "A.B.D", "A.E.D"]))
# [['A.B.C', 'A.B.D'], ['A.E.D']]
  • Related