How to create substrings efficiently-CodePudding

Given a string, typically a sentence, I want to extract all substrings of lengths 3, 4, 5, 6. How can I achieve this efficiently using only Python's standard library? Here is my approach, I am looking for one which is faster. To me it seems the three outer loops are inevitable either way, but maybe there is a low-level optimized solution with itertools or so.

import time

def naive(test_sentence, start, end):
    grams = []
    for word in test_sentence:
        for size in range(start, end):
            for i in range(len(word)):
                k = word[i:i size]
                if len(k)==size:
                    grams.append(k)
    return grams

n = 10**6
start, end = 3, 7
test_sentence = "Hi this is a wonderful test sentence".split(" ")

start_time = time.time()
for _ in range(n):
    naive(test_sentence, start, end)
end_time = time.time()

print(f"{end-start} seconds for naive approach")

Output of naive():

['thi', 'his', 'this', 'won', 'ond', 'nde', 'der', 'erf', 'rfu', 'ful', 'wond', 'onde', 'nder', 'derf', 'erfu', 'rful', 'wonde', 'onder', 'nderf', 'derfu', 'erful', 'wonder', 'onderf', 'nderfu', 'derful', 'tes', 'est', 'test', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'sent', 'ente', 'nten', 'tenc', 'ence', 'sente', 'enten', 'ntenc', 'tence', 'senten', 'entenc', 'ntence']

Second version:

def naive2(test_sentence,start,end):
    grams = []
    for word in test_sentence:
        if len(word) >= start:
            for size in range(start,end):
                for i in range(len(word)-size 1):
                    grams.append(word[i:i size])
    return grams

CodePudding user response：

Well, I think this is not possible to improve the algorithm, but you can micro-optimize the function:

def naive3(test_sentence,start,end):
    rng = range(start,end)
    return [word[i:i size] for word in test_sentence
                           if len(word) >= start
                           for size in rng
                           for i in range(len(word) 1-size)]

Python 3.8 introduces assignment Expressions that are quite useful for performance. Thus if you can use a recent version, then you can write:

def naive4(test_sentence,start,end):
    rng = range(start,end)
    return [word[i:i size] for word in test_sentence 
                           if (lenWord := len(word) 1) > start
                           for size in rng
                           for i in range(lenWord-size)]

Here are performance results:

naive2: 8.28 µs ±  55 ns per call
naive3: 7.28 µs ± 124 ns per call
naive4: 6.86 µs ±  48 ns per call    (20% faster than naive2)

Note that half of the time of naive4 is spent in creating the word[i:i size] string objects and the rest is mainly spent in the CPython interpreter (mainly due to the creation/reference-counting/deletion of variable-sized integer objects).

CodePudding user response：

I believe this will do it:

test_sentence = "Hi this is a wonderful test sentence".split()

lengths = [3, 4, 5, 6]

result = []
for t in test_sentence:
    for l in lengths:
        if len(t) >= l:
            start = 0
            while start   l <= len(t):
                result.append(t[start:start l])
                start  = 1