I have a huge amount (GB) of text to process, sentence by sentence.
In each sentence I have a costly operation to perform on numbers, so I check that this sentence contains at least one digit.
I have done this check using different means and measured those solutions using timeit
.
s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' # example
any(c.isdigit() for c in s)
3.61 µsre.search('\d', s)
402 nsd = re.compile('\d')
d.search(s)
126 ns'0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s
60ns
The last way is the fastest one, but it is ugly and probably 10x slower than possible.
Of course I could rewrite this in cython, but it seems overkill.
Is there a better pure python solution? In particular, I wonder why you can use str.startswith()
and str.endswith()
with a tuple argument, but it does not seem to be possible with in
operator.
CodePudding user response:
Actual performance might vary depending on your platform and python version, but on my setup (python 3.9.5 / Ubuntu), it turns out that re.match
is significantly faster than re.search
, and outperforms the long in
series version. Also, compiling the regex with [0-9]
instead of \d
provides a little improvement.
import re
from timeit import timeit
n = 10_000_000
s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'
# reference
timeit(lambda: '0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s, number=n)
# 2.1005349759998353
# re.search with \d, slower
re.compile('\d')
timeit(lambda: d.search(s), number=n)
# 2.9816031390000717
# re.search with [0-9], better but still slower then reference
d = re.compile('[0-9]')
timeit(lambda: d.search(s), number=n)
# 2.640713582999524
# re.match with [0-9], faster than reference
d = re.compile('[0-9]')
timeit(lambda: d.match(s), number=n)
# 1.5671786130005785
So, on my machine, using re.match
with a compiled [0-9]
pattern is about 25% faster than the long or ... in
chaining. And it looks better too.