How is str.find so fast?-CodePudding

I had an earlier problem where I was looking for a substring while iterating the string and using slicing. Turns out that's a really bad idea regarding performance. str.find is much faster. But I don't understand why?

import random
import string
import timeit

# Generate 1 MB of random string data
haystack = "".join(random.choices(string.ascii_lowercase, k=1_000_000))

def f():
    return [i for i in range(len(haystack)) if haystack[i : i   len(needle)] == needle]

def g():
    return [i for i in range(len(haystack)) if haystack.startswith(needle, i)]

def h():
    def find(start=0):
        while True:
            position = haystack.find(needle, start)
            if position < 0:
                return
            start = position   1
            yield position
    return list(find())

number = 100
needle = "abcd"
expectation = f()
for func in "fgh":
    assert eval(func   "()") == expectation
    t = timeit.timeit(func   "()", globals=globals(), number=number)
    print(func, t)

Results:

f 26.46937609199813
g 16.11952730899793
h 0.07721933699940564

CodePudding user response：

f and g are slow since they check if needle can be found in every possible location of haystack resulting in a O(n m) complexity. f is slower because of the slicing operation that creates a new string object (as pointed out by Barmar in the comments).

h is fast because it can skip many locations. For example, if the needle string is not found, only one find is performed. The built-in find function is highly optimized in C and thus faster than an interpreted pure-Python code. Additionally, the find function use an efficient algorithm called Crochemore and Perrin's Two-Way. This algorithm is much faster than searching needle at every possible location of haystack when the string is relatively big. The related CPython code is available here.

If the number of occurrence is relatively small, your implementation should already be good. Otherwise, it may be better to use a custom variant based on the CPTW algorithm of possibly the KMP algorithm but doing that in pure-Python will be very inefficient. You could do that in C or with Cython. That being said this is not trivial to do and not great to maintain.

CodePudding user response：

The built-in Python functions are implemented in C, which allows them to be much faster. It's not possible to make a function that performs just as well when using Python.