Home > Software design >  Group / sort items from a list into sublists where every next item is a substring of the previous
Group / sort items from a list into sublists where every next item is a substring of the previous

Time:08-11

I have the following list:

  x = ['long stories about hard yarn',
     'The bears',
     'unpleasant',
     'The bears endure tiring',
     'The bears endure',
     'tiring and unpleasant',
     'long stories',
     'hard yarn',
     'tiring',
     'The bears endure tiring and unpleasant']

I'm expecting the following sublist output:

[['The bears endure tiring and unpleasant', 'tiring and unpleasant', 'tiring'],
 ['The bears endure tiring and unpleasant',
  'tiring and unpleasant',
  'unpleasant'],
 ['The bears endure tiring and unpleasant',
  'The bears endure tiring',
  'tiring'],
 ['The bears endure tiring and unpleasant',
  'The bears endure tiring',
  'The bears endure',
  'The bears'],
 ['long stories about hard yarn', 'long stories'],
 ['long stories about hard yarn', 'hard yarn']]

I wrote the code in the following way:

from itertools import combinations, permutations

perm = permutations(x, 2)

simlr = []
for i,j in list(perm):
    if len(i) > len(j):
        if check(i,j):
            simlr.append([i,j])
simlr

Getting the following output:

[['long stories about hard yarn', 'long stories'],
 ['long stories about hard yarn', 'hard yarn'],
 ['The bears endure tiring', 'The bears'],
 ['The bears endure tiring', 'The bears endure'],
 ['The bears endure tiring', 'tiring'],
 ['The bears endure', 'The bears'],
 ['tiring and unpleasant', 'unpleasant'],
 ['tiring and unpleasant', 'tiring'],
 ['The bears endure tiring and unpleasant', 'The bears'],
 ['The bears endure tiring and unpleasant', 'unpleasant'],
 ['The bears endure tiring and unpleasant', 'The bears endure tiring'],
 ['The bears endure tiring and unpleasant', 'The bears endure'],
 ['The bears endure tiring and unpleasant', 'tiring and unpleasant'],
 ['The bears endure tiring and unpleasant', 'tiring']]

The above output is valid, but I need to get refined output (grouping) where every sublist has main large string following by respective substrings sorted.

CodePudding user response:

What you are asking is to generate all branches from a directed acyclic graph.

Instead of storing your graph as a list of pairs [(parent, child), ...] as you did, I think it is easier to store your graph as a dictionary of lists, {parent: [child1, child2, ...], ...}

Then write a function that, starting from a given root, find all branches starting from that root.

Then write a function that identifies all "roots", i.e., all nodes that don't have parents.

As a bonus, we should remove "shortcuts" from the graph. For instance, 'The bears' should not be a direct child of 'The bears endure tiring and unpleasant', because it is already a child of 'The bears endure'.

from itertools import permutations
from collections import defaultdict

def build_tree(sentences):
    tree = defaultdict(list)
    for x,y in permutations(sentences, 2):
        if x in y:
            if not any(x in z and z in y and x != z != y for z in sentences):
                tree[y].append(x)
    return tree

def gen_branches(tree, root, acc=()):
    if (root not in tree or not tree[root]):
        yield acc   (root,)
    else:
        for child in tree[root]:
            yield from gen_branches(tree, child, acc (root,))

def get_roots(tree):
    candidates = set(tree)
    have_parents = set(child for children in tree.values() for child in children)
    return candidates.difference(have_parents)

sentences = ['long stories about hard yarn',
     'The bears',
     'unpleasant',
     'The bears endure tiring',
     'The bears endure',
     'tiring and unpleasant',
     'long stories',
     'hard yarn',
     'tiring',
     'The bears endure tiring and unpleasant']

tree = build_tree(sentences)
print(tree)
# defaultdict(<class 'list'>, {'The bears endure': ['The bears'], 'tiring and unpleasant': ['unpleasant', 'tiring'], 'The bears endure tiring and unpleasant': ['The bears endure tiring', 'tiring and unpleasant'], 'The bears endure tiring': ['The bears endure', 'tiring'], 'long stories about hard yarn': ['long stories', 'hard yarn']})



roots = get_roots(tree)
print(roots)
# {'The bears endure tiring and unpleasant', 'long stories about hard yarn'}

branches = [b for root in roots for b in gen_branches(tree, root)]
print(branches)
# [('The bears endure tiring and unpleasant', 'The bears endure tiring', 'The bears endure', 'The bears'),
#  ('The bears endure tiring and unpleasant', 'The bears endure tiring', 'tiring'),
#  ('The bears endure tiring and unpleasant', 'tiring and unpleasant', 'unpleasant'),
#  ('The bears endure tiring and unpleasant', 'tiring and unpleasant', 'tiring'),
#  ('long stories about hard yarn', 'long stories'),
#  ('long stories about hard yarn', 'hard yarn')]
  • Related