I have the following list:
x = ['long stories about hard yarn',
'The bears',
'unpleasant',
'The bears endure tiring',
'The bears endure',
'tiring and unpleasant',
'long stories',
'hard yarn',
'tiring',
'The bears endure tiring and unpleasant']
I'm expecting the following sublist output:
[['The bears endure tiring and unpleasant', 'tiring and unpleasant', 'tiring'],
['The bears endure tiring and unpleasant',
'tiring and unpleasant',
'unpleasant'],
['The bears endure tiring and unpleasant',
'The bears endure tiring',
'tiring'],
['The bears endure tiring and unpleasant',
'The bears endure tiring',
'The bears endure',
'The bears'],
['long stories about hard yarn', 'long stories'],
['long stories about hard yarn', 'hard yarn']]
I wrote the code in the following way:
from itertools import combinations, permutations
perm = permutations(x, 2)
simlr = []
for i,j in list(perm):
if len(i) > len(j):
if check(i,j):
simlr.append([i,j])
simlr
Getting the following output:
[['long stories about hard yarn', 'long stories'],
['long stories about hard yarn', 'hard yarn'],
['The bears endure tiring', 'The bears'],
['The bears endure tiring', 'The bears endure'],
['The bears endure tiring', 'tiring'],
['The bears endure', 'The bears'],
['tiring and unpleasant', 'unpleasant'],
['tiring and unpleasant', 'tiring'],
['The bears endure tiring and unpleasant', 'The bears'],
['The bears endure tiring and unpleasant', 'unpleasant'],
['The bears endure tiring and unpleasant', 'The bears endure tiring'],
['The bears endure tiring and unpleasant', 'The bears endure'],
['The bears endure tiring and unpleasant', 'tiring and unpleasant'],
['The bears endure tiring and unpleasant', 'tiring']]
The above output is valid, but I need to get refined output (grouping) where every sublist has main large string following by respective substrings sorted.
CodePudding user response:
What you are asking is to generate all branches from a directed acyclic graph.
Instead of storing your graph as a list of pairs [(parent, child), ...]
as you did, I think it is easier to store your graph as a dictionary of lists, {parent: [child1, child2, ...], ...}
Then write a function that, starting from a given root, find all branches starting from that root.
Then write a function that identifies all "roots", i.e., all nodes that don't have parents.
As a bonus, we should remove "shortcuts" from the graph. For instance, 'The bears'
should not be a direct child of 'The bears endure tiring and unpleasant'
, because it is already a child of 'The bears endure'
.
from itertools import permutations
from collections import defaultdict
def build_tree(sentences):
tree = defaultdict(list)
for x,y in permutations(sentences, 2):
if x in y:
if not any(x in z and z in y and x != z != y for z in sentences):
tree[y].append(x)
return tree
def gen_branches(tree, root, acc=()):
if (root not in tree or not tree[root]):
yield acc (root,)
else:
for child in tree[root]:
yield from gen_branches(tree, child, acc (root,))
def get_roots(tree):
candidates = set(tree)
have_parents = set(child for children in tree.values() for child in children)
return candidates.difference(have_parents)
sentences = ['long stories about hard yarn',
'The bears',
'unpleasant',
'The bears endure tiring',
'The bears endure',
'tiring and unpleasant',
'long stories',
'hard yarn',
'tiring',
'The bears endure tiring and unpleasant']
tree = build_tree(sentences)
print(tree)
# defaultdict(<class 'list'>, {'The bears endure': ['The bears'], 'tiring and unpleasant': ['unpleasant', 'tiring'], 'The bears endure tiring and unpleasant': ['The bears endure tiring', 'tiring and unpleasant'], 'The bears endure tiring': ['The bears endure', 'tiring'], 'long stories about hard yarn': ['long stories', 'hard yarn']})
roots = get_roots(tree)
print(roots)
# {'The bears endure tiring and unpleasant', 'long stories about hard yarn'}
branches = [b for root in roots for b in gen_branches(tree, root)]
print(branches)
# [('The bears endure tiring and unpleasant', 'The bears endure tiring', 'The bears endure', 'The bears'),
# ('The bears endure tiring and unpleasant', 'The bears endure tiring', 'tiring'),
# ('The bears endure tiring and unpleasant', 'tiring and unpleasant', 'unpleasant'),
# ('The bears endure tiring and unpleasant', 'tiring and unpleasant', 'tiring'),
# ('long stories about hard yarn', 'long stories'),
# ('long stories about hard yarn', 'hard yarn')]