How to generate all subsequent combinations of 1, 2 and 3 words from a given string in python?-CodePudding

I have a string in python. I want to get all one word substrings, all 2 word substrings and all 3 word substrings. What is the most efficient way to do this?

My current solution is like this:

>>> s = "This is the example string of which I want to generate subsequent combinations"
>>> words = s.split()
>>> lengths = [1, 2, 3]
>>> ans = []
>>> for ln in lengths:
...     for i in range(len(words)-ln 1):
...         ans.append(" ".join(words[i:i ln]))
... 
>>> print(ans)
['This', 'is', 'the', 'example', 'string', 'of', 'which', 'I', 'want', 'to', 'generate', 'subsequent', 'combinations', 'This is', 'is the', 'the example', 'example string', 'string of', 'of which', 'which I', 'I want', 'want to', 'to generate', 'generate subsequent', 'subsequent combinations', 'This is the', 'is the example', 'the example string', 'example string of', 'string of which', 'of which I', 'which I want', 'I want to', 'want to generate', 'to generate subsequent', 'generate subsequent combinations']

CodePudding user response：

you can do it like this:

from itertools import chain, combinations

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return list(map(lambda x: " ".join(x), chain.from_iterable(combinations(s, r) for r in range(1,4))))

s = "This is the example string of which I want to generate subsequent combinations"
print(powerset(s.split()))

for detailed understanding read: https://stackoverflow.com/a/1482316/17073342

CodePudding user response：

FWIW, you can do what you have as a list comprehension:

[' '.join(words[i:i l]) for l in [1,2,3] for i in range(len(words)-l 1)]

Is it faster? A little bit:

%%timeit
ans = []
for ln in [1,2,3]:
    for i in range(len(words)-ln 1):
        ans.append(" ".join(words[i:i ln]))
        
# 8.46 µs ± 89.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
[' '.join(words[i:i l]) for l in [1,2,3] for i in range(len(words)-l 1)]


# 7.03 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Is it more readable? Probably not. I might just stick with what you have.

CodePudding user response：

I think the easiest to understand (for me anyways) and likely the fastest will be to handle the special cases of the first two words then iterate over the remaining words while keeping track of the prior word(s).

It has the side benefit of being the fastest so far.

words = "This is the example string of which I want to generate subsequent combinations".split()
prior_prior_word = words[0]
prior_word = words[1]
ans = [prior_prior_word, prior_word, f"{prior_prior_word} {prior_word}"]
for word in words[2:]:
    ans.append(f"{word}")
    ans.append(f"{prior_word} {word}")
    ans.append(f"{prior_prior_word} {prior_word} {word}")
    prior_prior_word = prior_word
    prior_word = word
print(ans)

If you want to timeit, you can try:

import timeit

ruchit = '''
words = "This is the example string of which I want to generate subsequent combinations".split()
def test(words):
    lengths = [1, 2, 3]
    ans = []
    for ln in lengths:
        for i in range(len(words)-ln 1):
            ans.append(" ".join(words[i:i ln]))
    return ans
'''

tom = '''
words = "This is the example string of which I want to generate subsequent combinations".split()
def test(words):
    return [' '.join(words[i:i l]) for l in [1,2,3] for i in range(len(words)-l 1)]
'''
        
jonsg = '''
words = "This is the example string of which I want to generate subsequent combinations".split()
def test(words):
    prior_prior_word = words[0]
    prior_word = words[1]
    ans = [prior_prior_word, prior_word, f"{prior_prior_word} {prior_word}"]
    for word in words[2:]:
        ans.append(f"{word}")
        ans.append(f"{prior_word} {word}")
        ans.append(f"{prior_prior_word} {prior_word} {word}")
        prior_prior_word = prior_word
        prior_word = word
    return ans
'''

runs = 1_000_000
print("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
print(f"Test: ruchit Time: {timeit.timeit('test(words)', setup=ruchit, number=runs)}")
print("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
print(f"Test: tom Time: {timeit.timeit('test(words)', setup=tom, number=runs)}")
print("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
print(f"Test: jonsg Time: {timeit.timeit('test(words)', setup=jonsg, number=runs)}")
print("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

This gives me:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Test: ruchit Time: 8.692457999999998
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Test: tom Time: 7.512314900000002
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Test: jonsg Time: 3.7232652
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Your mileage may vary.