Merging overlapping (str) objects-CodePudding

The problem is the following :

I want to go from having this set

{'A/B', 'B/C', 'C/D', 'D/E', ..., 'U/V', 'V/W', ..., 'X/Y', ..., 'Z', ...}

to this set

{'A/B/C/D/E', ..., 'U/V/W', ..., 'X/Y', ..., 'Z', ...}

where the objects A, B, C ... are just strings of characters. The output solution should be independent of the order in which the objects appears (i.e. if you scramble the objects in the set, the solution should always be the same)

In other words I want to merge overlapping objects.

Inputs of the following form cannot happen :

{"A/B", "B/C", "B/D"}
{"A/B", "B/C", "C/A"}

There can be objects with no '/' in them.

Here is a partial solution I've come up with :

    example={'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y'}
    
    def ext_3plus(unit):
        for couple in list(itertools.combinations(list(unit),2)):
            if '/' in couple[0] and '/' in couple[1]:
                if couple[0].split('/')[0]==couple[1].split('/')[1]:
                    unit.remove(couple[0])
                    unit.remove(couple[1])
                    unit.add(couple[1].split('/')[0] '/' couple[0])
                if couple[0].split('/')[1]==couple[1].split('/')[0]:
                    unit.remove(couple[0])
                    unit.remove(couple[1])
                    unit.add(couple[0] '/' couple[1].split('/')[1])
            else: #the input can contain object not having '/'
                continue

There is two problems, first it does only one iteration, the result on {'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y'}

is :

{'A/B/C', 'C/D/E', 'U/V/W', 'X/Y'}

Second, if I include objects containing no '/', the input being {'A/B', 'B/C', 'C/D', 'D/E','U/V', 'V/W','X/Y','Z'}, the result is different from the previous one :

{'A/B', 'B/C/D', 'D/E', 'U/V/W', 'X/Y', 'Z'}

So there should be a recursive call on the first iteration etc. How should it be done ?

CodePudding user response：

It might not be the most efficient, but you could just repeat the loop until there's nothing modified.

def ext_3plus(unit):
    while True:
        oldlen = len(unit)
        for couple in itertools.combinations(list(unit),2):
            if '/' in couple[0] and '/' in couple[1]:
                if couple[0].split('/')[0]==couple[1].split('/')
                    unit.remove(couple[0])
                    unit.remove(couple[1])
                    unit.add(couple[1].split('/')[0] '/' couple[0])
                    modified = True
                if couple[0].split('/')[1]==couple[1].split('/')[0]
                    unit.remove(couple[0])
                    unit.remove(couple[1])
                    unit.add(couple[0] '/' couple[1].split('/')[1])
        if len(unit) == oldlen:
            # Nothing was merged, so we're done
            break

CodePudding user response：

If I understood correctly this can be seen as a graph problem, and solve as such:

import networkx as nx

example = {'A/B', 'B/C', 'C/D', 'D/E', 'U/V', 'V/W', 'X/Y', "Z"}

# convert each string to a and edge
# each pattern to the side of / is a node
edges = [tuple(s.split("/")) for s in example if "/" in s]

nodes = [s for s in example if "/" not in s]

# create directed graph from edges
g = nx.from_edgelist(edges, create_using=nx.DiGraph)
g.add_nodes_from(nodes)

# find each path using topological sort
runs, current = [], []
for e in nx.topological_sort(g):
    # start a new path each time a node with in-degree 0
    # in-degree 0 means it is the start of a new path
    if g.in_degree(e) == 0:
        if current:
            runs.append(current)
            current = []
    current.append(e)

if current:
    runs.append(current)

# format the result
result = ["/".join(run) for run in runs]
print(result)

Output

['Z', 'U/V/W', 'X/Y', 'A/B/C/D/E']

If I'm not mistaken the overall complexity of this approach is O(n). More on topological sorting can be found here.

CodePudding user response：

You can use a recursive generator function:

vals = ['A/B', 'B/C', 'C/D', 'D/E', 'U/V', 'V/W', 'X/Y']
data = [i.split('/') for i in vals]
def paths(d, c = [], s = []):
   if not (k:=[b for a, b in data if a == d]):
      yield c [d]
      if (t:=[a for a, b in data if a not in s [d]]):
         yield from paths(t[0], c = [], s=s [d])
   else:
       yield from [j for i in k for j in paths(i, c=c [d], s=s [d])]

vals = list(paths(data[0][0]))

Output:

[['A', 'B', 'C', 'D', 'E'], ['U', 'V', 'W'], ['X', 'Y']]