Issue
A string should be tested whether it contains numbers with optional characters between the digits. I'll use the terms number meaning a series of digits (e.g. 123
) and digit meaning an element of a number (e.g. 0
, 1
, 2
).
Example strings the regex should test and detect the number 012345
:
abc 012345 def
abc 0.1.2.3.4.5 def
abc 0-1 2a3x4,,,5 def
Characteristics
- a) The numbers are known, i.e. it should not match any number but specific numbers.
- b) The symbols between digits are unknown and of unknown lengths; though in practice, the length won't exceed 10 in 99% of cases.
- c) The strings to test are < 100 characters in 99% of cases.
The code run in JavaScript Node.js:
// For each regex
for (const regex of regexes) {
if (new RegExp(regex, 'iu').test(text)) {
// ...
}
}
The most simple (and most inefficient) regex requires a significant compute time:
const regexes = [
'0.*1.*2.*3.*4.*5',
'1.*1.*2.*2.*3.*3'
];
Is there any potential optimization, either in the regex or in the JS code?
Performance
Execution time comparison from suggestions in comments and answers below, in order from fastest to slowest:
- remove non-number chars with
str.replace(/\D/g, '')
and then use the regex012345
is the fastest; it's not a pure regex optimization, but as I said, the optimization can include the JavaScript code (thanks @anubhava) - use negated character classes
0[^1]*1[^2]*2[^3]*3[^4]*4[^5]*5
is only ~3% slower than (1) and doesn't require any additional JS code, so it's the best pure regex optimization (thanks @WiktorStribiżew) - with real word data, both the greedy
1.*2.*3.*4.*5
and range limited1.{0,10}2.{0,10}3.{0,10}4.{0,10}5
variants are ~50% - 90% slower than (1)
CodePudding user response:
You may use negated character classes negating the char that is on the right:
const regexes = [
'0[^1]*1[^2]*2[^3]*3[^4]*4[^5]*5',
'1[^1]*1[^2]*2[^2]*2[^3]*3[^3]*3'
]
See, [^1]*1
matches any zero or more chars other than 1
and then a 1
char. Negated character classes are much more efficient than .*
or .*?
construts since they help avoid excessive backtracking and should be used always when you need to match shortest substrings between two chars.