Efficiently check if string contains a digit in python-CodePudding

I have a huge amount (GB) of text to process, sentence by sentence. In each sentence I have a costly operation to perform on numbers, so I check that this sentence contains at least one digit. I have done this check using different means and measured those solutions using timeit.

s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' # example

any(c.isdigit() for c in s) 3.61 µs
re.search('\d', s) 402 ns
d = re.compile('\d') d.search(s) 126 ns
'0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s 60ns

The last way is the fastest one, but it is ugly and probably 10x slower than possible.

Of course I could rewrite this in cython, but it seems overkill.

Is there a better pure python solution? In particular, I wonder why you can use str.startswith() and str.endswith() with a tuple argument, but it does not seem to be possible with in operator.

CodePudding user response：

Actual performance might vary depending on your platform and python version, but on my setup (python 3.9.5 / Ubuntu), it turns out that re.match is significantly faster than re.search, and outperforms the long in series version. Also, compiling the regex with [0-9] instead of \d provides a little improvement.

import re
from timeit import timeit

n = 10_000_000
s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'

# reference
timeit(lambda: '0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s, number=n)
# 2.1005349759998353

# re.search with \d, slower
re.compile('\d')
timeit(lambda: d.search(s), number=n)
# 2.9816031390000717

# re.search with [0-9], better but still slower then reference
d = re.compile('[0-9]')
timeit(lambda: d.search(s), number=n)
# 2.640713582999524

# re.match with [0-9], faster than reference
d = re.compile('[0-9]')
timeit(lambda: d.match(s), number=n)
# 1.5671786130005785

So, on my machine, using re.match with a compiled [0-9] pattern is about 25% faster than the long or ... in chaining. And it looks better too.