input list:
['A.B.C.D','A.A.C.D','A.B.E.F','A.B.E.GG']
The strings are variables concatenated with a dot. So in first string, variables are A B C D, but in last string variable are A B E GG (variables can be of varying length), but the strings will always have the same number of variables separated by the dot. I would like to group those strings together which have only one different variable. So above would produce 2 groups.
['A.B.C.D','A.A.C.D'] ['A.B.E.F','A.B.E.GG']
The the difference has to be in the same variable. not one difference across all variables. and each string should be included only once, not multiple times across groups.
I wrote an algorithm which does it but is too complicated. I also tried via itertools groupby but could not get it to produce the correct results:
import itertools
import difflib
class Grouper:
def __init__(self, diff):
self.last = None
self.diff = diff
def get_diffs(self, curr_key):
if self.last == None:
return []
dims = curr_key.split('.')
previous_dims = self.last.split('.')
diffs = list((dm for dm in difflib.ndiff(dims, previous_dims) if '-' not in dm and '?' not in dm))
return [n for n, d in enumerate(diffs) if ' ' in d]
def __call__(self, item):
result = self.get_diffs(item)
print(item)
self.last = item
return result
grp = Grouper(data)
groups = []
uniquekeys = []
for k, g in itertools.groupby(data, grp):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
CodePudding user response:
I added another string "A.B.C.F", this string can be matched with "A.B.C.D" and "A.B.E.F":
import itertools
def main(l: list) -> list:
splits = [tuple(s.split(".")) for s in l]
groups = {}
# Dict of {(tuple, index of difference): list of tuples that match the key}
for split in splits:
matched = False
for (t, i) in groups.keys():
# We match two splits if they only have one difference
diffs = [i for i in range(len(split)) if split[i] != t[i]]
if len(diffs) == 1:
# The strings match but is the first match of 't'?
if i is None:
groups.pop((t, i))
groups[(t, diffs[0])] = [split]
# We found a match, stop the matching of this tuple
matched = True
break
elif diffs[0] == i:
groups[(t, i)].append(split)
matched = True
break
if not matched:
# Let's add this split as a candidate for future splits
groups[(split, None)] = []
return [[".".join(k)] [".".join(s) for s in v] for (k, i), v in groups.items()]
print(main(["A.B.C.D", "A.A.C.D", "A.B.E.F", "A.B.E.GG", "A.B.C.F"]))
# [['A.B.C.D', 'A.A.C.D'], ['A.B.E.F', 'A.B.E.GG'], ['A.B.C.F']]
print(main(["A.B.C", "A.B.D", "A.E.D"]))
# [['A.B.C', 'A.B.D'], ['A.E.D']]