Home > Blockchain >  Determine the shortest possible match of a regular expression
Determine the shortest possible match of a regular expression

Time:01-23

I have an randomly ordered array of regular expressions like this:

let patterns = [
    /foo ba r/,
    /foo/,
    /foo bar/,
    /foobar/,
    /m[eo]{4,}w/,
    /boo/,
    /fooo*/,
    /meow/
]

I'm not sure if this is possible but I would like to write an algorithm which sorts the regular expressions from least greedy to most greedy, like this:

[
    /foo/,
    /boo/,
    /fooo*/,
    /meow/,
    /foobar/,
    /foo bar/,
    /m[eo]{4,}w/,
    /foo ba r/
]

I would imagine such sorting could be achieved like so:

patterns.sort((p1, p2) { return p1.greediness() - p2.greediness() });

But there exists no method called greediness in the the RegExpr class.

Ideally, the greediness method would return the number of characters which could be possibly matched at minimum. i.e:

/foo/.greediness() == 3
/boo/.greediness() == 3
/fooo*/.greediness() == 3
/meow/.greediness() == 4
/foobar/.greediness() == 6
/foo bar/.greediness() == 6
/m[eo]{4,}w/.greediness() == 6
/foo ba r/.greediness() == 6

What would your solution be to this problem?

CodePudding user response:

Here is a concept of what you might use as an idea.

You can use prototype to define a custom function for the RegEx:

// Extend Regular Expression with a custom method
RegExp.prototype.greediness = function () {
  return (this.source.match(/[* {]/g) || []).length
}

And then you perform a sorting on that patterns:

// Sort patterns by greediness
const patternByGreediness = patterns.sort((a, b) => {
  return b.greediness() - a.greediness()
})

Check full code snippet

let patterns = [
  /foo ba r/,
  /foo/,
  /foo bar/,
  /foobar/,
  /m[eo]{4,}w/,
  /boo/,
  /fooo*/,
  /meow/
]

// Sort
patterns.sort((a, b) => {
  return b.source.length - a.source.length
}).reverse()

// Reverse order as you want
console.log(patterns)

// Extend Regular Expression with a custom method
RegExp.prototype.greediness = function () {
  return (this.source.match(/[* {]/g) || []).length
}

// Sort patterns by greediness
const patternByGreediness = patterns.sort((a, b) => {
  return b.greediness() - a.greediness()
})
console.log(patternByGreediness)

// Get a number of how greedy a pattern is
console.log(/m[eo]{4,}w/.greediness())

Hope that helps.

CodePudding user response:

As Pointy said in the comments, this is a hard problem.

Here is the beginning of a solution:

const greediness = (s) =>
  s .toString () .slice (1, -1)
    .replace (/\[[^\]] ]/g, 'X')
    .replace (/.\{((\d )(,\d*))\}/g, (s, a, ds, _) => 'X' .repeat (Number (ds)))
    .replace (/.\ /g, 'X')
    .replace (/.\?/g, '')
    .replace (/.\*/g, '')
    .length

const sortByGreediness = (patterns) =>
  [...patterns] .sort ((a, b) => greediness (a) - greediness (b))
    // .map (s => [s, greediness (s)])  // to show sizes


const patterns = [/foo ba r/, /foo/, /foo bar/, /foobar/, /m[eo]{4,}w/, /boo/, /fooo*/, /meow/]


console .log (sortByGreediness (patterns))

We simply take the text of the regex and replace quantifiers and their preceding characters with the smallest number of characters that might match. We do something similar for blocks like [eo] and X{4,}.

This we might go through steps like this:

m[eo]{4,}wp u r?r*
mX{4,}wp u r?r*
mXXXXwp u r?r*
mXXXXXwXr?r*
mXXXXXwXr*
mXXXXXwX
  - length 7

But this doesn't touch on the complexities that can be inside a regex, and doesn't even try to handle capturing groups. I think it would next to impossible to do for the full regex spec, but perhaps this can be expanded toward what you need.

(If you're getting more complex, you might want to do this repeatedly with code something like the following, or with a while loop in place of its recursion.

const greediness = (s, next = s .toString () .slice (1, -1), prev = null) =>
  next === prev
    ? next .length
    : greediness (s, next
        .replace (/\[[^\]] ]/g, 'X')
        .replace (/.\{((\d )(,\d*))\}/g, (s, a, ds, _) => 'X' .repeat (Number (ds)))
        .replace (/.\ /g, 'X')
        .replace (/.\?/g, '')
        .replace (/.\*/g, '')
      , next)
  • Related