Home > Software engineering >  Need help to shorten a regex as much as possible
Need help to shorten a regex as much as possible

Time:08-18

I have a bit of a weirdly specific problem.

Given a string that looks like


.... // some unimportant info 
--------------------

Sockets: r-r-g-b b-b

--------------------
... // some unimportant info

I'm trying to determine whether or not my string possesses at least arbitrary number of "r"s, "g"s, and "b"s that are in a connected group in the "Sockets: " line. "The Sockets:" is followed by 1 to 6 r/g/b, that can either be connected together (by a '-') or not (in that case there'll be a '' between two consecutive letters.)

Example : the string above has 2r, 2r1g1b, 2b, but doesn't have 3r nor does it have 3b (because they aren't connected together).

I currently have a solution that's working pretty well : ts:. (?=(\S*r){X})(?=(\S*g){Y})(\S*b){Z}, where X, Y, Z are the numbers of r, g, b that I'm matching. I do, however, need to reduce it as much as possible (and everything has to be a single regex) because the searchbar that I'm using it for has limited space.

I had an idea about using ((\w-)*\w) to match every single r-r-g... chain, then applying some regex onto the result to verify that the chain has the right number of r, g, b, but I'm unsure how to exactly do that.

Hence :

  1. is there a short way to test whether a single-line string contains at least N of a specific character?

  2. Is there a short way to apply this test repeatedly on what I've extracted from the previous capture group?

  3. If 1) and 2) aren't feasible, do you see a way for me to reduce the size of my first solution?

CodePudding user response:

If you're anyways generating these with a script, you could test each letter in each group individually. You do not need to know beforehand whether the line actually has 1 or 6 groups, but you'd just assume that it can have 6 every time.

So you'd first test for X, Y, and Z in group that has 5 other groups on its right (this only exists if there were 6 groups in total):

ts: (\S*r){X}(?=(.* ){5})
ts: (\S*g){Y}(?=(.* ){5})
ts: (\S*b){Z}(?=(.* ){5})

Then you'd test X, Y, and Z in group that has 4 other groups on its right (could be the second group if there happened to be 6 groups in total, or the first group if there were 5 groups in total, does not exist at all if we have 4 or fewer groups in total):

ts: (\S*r){X}(?=(.* ){4})
ts: (\S*g){Y}(?=(.* ){4})
ts: (\S*b){Z}(?=(.* ){4})

and then just repeat this:

ts: (\S*r){X}(?=(.* ){3})
ts: (\S*g){Y}(?=(.* ){3})
ts: (\S*b){Z}(?=(.* ){3})
ts: (\S*r){X}(?=(.* ){2})
ts: (\S*g){Y}(?=(.* ){2})
ts: (\S*b){Z}(?=(.* ){2})
ts: (\S*r){X}(?=(.* ){1})
ts: (\S*g){Y}(?=(.* ){2})
ts: (\S*b){Z}(?=(.* ){1})

until you get to the final, rightmost group, which always exists:

ts: (\S*r){X}(?=(.* ){0})
ts: (\S*g){Y}(?=(.* ){0})
ts: (\S*b){Z}(?=(.* ){0})

CodePudding user response:

Regular expressions are weak when it comes to counting. They can count a single independent sequence all right (including some simple arithmetic properties, such as even or odd lengths), but if the lengths of two different parts are interdependent, regular expressions become anything from unwieldy to impossible (which happens when maximal length is unbounded). The classic example is strings of the form aⁿbⁿ (where means repeated n times) for all integers n, which can't be handled by regular expressions. This is due to a fundamental limitation of regular languages, and is behind such laws as the pumping lemma. They can handle bounded lengths (any finite language is regular; for example aⁿbⁿ where 0 ≤ n < 10¹⁰⁰), but in the general case this involves alternation, which is why it becomes unwieldy.

It should be noted that regexes are distinct from regular expressions; they are programmatic implementations of regular expressions and can be broader in the languages they specify, but they still generally suffer from the same limitations. The one feature that actually expands the language category is recursion, though not every regex library supports them, nor do they help in every case. They can, for example, solve aⁿbⁿ (construction left as an exercise), but not aⁿbⁿcⁿ (proof left as an exercise).

Since the language in question is bounded, it technically can be matched exactly to a regex by the unwieldy method of listing alternatives (as previously mentioned). This would be messier than a non-regex solution, so isn't much worth considering.

If the regex doesn't need to validate every defining property of the language, then it can be more permissive, matching more than the language. By ignoring some properties, you can create a simple regular language that includes the (possibly non-regular) target language.

In particular, if the regex doesn't need to enforce the 'r-g-b' order, a simple regex can check just the letters and the length: Sockets: ([rgb](?:[- ][rgb]){0,5})$. The prefix and suffix expressions can be changed if the pattern is to match in other contexts.

  • Related