I had an earlier problem where I was looking for a substring while iterating the string and using slicing. Turns out that's a really bad idea regarding performance. str.find
is much faster. But I don't understand why?
import random
import string
import timeit
# Generate 1 MB of random string data
haystack = "".join(random.choices(string.ascii_lowercase, k=1_000_000))
def f():
return [i for i in range(len(haystack)) if haystack[i : i len(needle)] == needle]
def g():
return [i for i in range(len(haystack)) if haystack.startswith(needle, i)]
def h():
def find(start=0):
while True:
position = haystack.find(needle, start)
if position < 0:
return
start = position 1
yield position
return list(find())
number = 100
needle = "abcd"
expectation = f()
for func in "fgh":
assert eval(func "()") == expectation
t = timeit.timeit(func "()", globals=globals(), number=number)
print(func, t)
Results:
f 26.46937609199813
g 16.11952730899793
h 0.07721933699940564
CodePudding user response:
f
and g
are slow since they check if needle
can be found in every possible location of haystack
resulting in a O(n m)
complexity. f
is slower because of the slicing operation that creates a new string object (as pointed out by Barmar in the comments).
h
is fast because it can skip many locations. For example, if the needle
string is not found, only one find
is performed. The built-in find
function is highly optimized in C and thus faster than an interpreted pure-Python code. Additionally, the find
function use an efficient algorithm called Crochemore and Perrin's Two-Way. This algorithm is much faster than searching needle
at every possible location of haystack
when the string is relatively big. The related CPython code is available here.
If the number of occurrence is relatively small, your implementation should already be good. Otherwise, it may be better to use a custom variant based on the CPTW algorithm of possibly the KMP algorithm but doing that in pure-Python will be very inefficient. You could do that in C or with Cython. That being said this is not trivial to do and not great to maintain.
CodePudding user response:
The built-in Python functions are implemented in C, which allows them to be much faster. It's not possible to make a function that performs just as well when using Python.