Home > front end >  How to optimize number regex in JavaScript?
How to optimize number regex in JavaScript?

Time:10-06

Issue

A string should be tested whether it contains numbers with optional characters between the digits. I'll use the terms number meaning a series of digits (e.g. 123) and digit meaning an element of a number (e.g. 0, 1, 2).

Example strings the regex should test and detect the number 012345:

abc 012345 def
abc 0.1.2.3.4.5 def
abc 0-1 2a3x4,,,5 def

Characteristics

  • a) The numbers are known, i.e. it should not match any number but specific numbers.
  • b) The symbols between digits are unknown and of unknown lengths; though in practice, the length won't exceed 10 in 99% of cases.
  • c) The strings to test are < 100 characters in 99% of cases.

The code run in JavaScript Node.js:

// For each regex
for (const regex of regexes) {
  if (new RegExp(regex, 'iu').test(text)) {
    // ...
  }
}

The most simple (and most inefficient) regex requires a significant compute time:

const regexes = [
  '0.*1.*2.*3.*4.*5',
  '1.*1.*2.*2.*3.*3'
];

Is there any potential optimization, either in the regex or in the JS code?


Performance

Execution time comparison from suggestions in comments and answers below, in order from fastest to slowest:

  1. remove non-number chars with str.replace(/\D/g, '') and then use the regex 012345 is the fastest; it's not a pure regex optimization, but as I said, the optimization can include the JavaScript code (thanks @anubhava)
  2. use negated character classes 0[^1]*1[^2]*2[^3]*3[^4]*4[^5]*5 is only ~3% slower than (1) and doesn't require any additional JS code, so it's the best pure regex optimization (thanks @WiktorStribiżew)
  3. with real word data, both the greedy 1.*2.*3.*4.*5 and range limited 1.{0,10}2.{0,10}3.{0,10}4.{0,10}5 variants are ~50% - 90% slower than (1)

CodePudding user response:

You may use negated character classes negating the char that is on the right:

const regexes = [
  '0[^1]*1[^2]*2[^3]*3[^4]*4[^5]*5',
  '1[^1]*1[^2]*2[^2]*2[^3]*3[^3]*3'
]

See, [^1]*1 matches any zero or more chars other than 1 and then a 1 char. Negated character classes are much more efficient than .* or .*? construts since they help avoid excessive backtracking and should be used always when you need to match shortest substrings between two chars.

  • Related