Home > database >  Problem with optional non-capturing group in Regex
Problem with optional non-capturing group in Regex

Time:07-02

I am trying to do regex parsing and matching and optionally discard the rest of the string.

My strings are of type:

[GROUP 1][delimiter 1][GROUP 2][delimiter 2][GROUP 3][delimiter 3 - optional][REST OF THE STRING - optional]

For example:

07. Neospace - Into The Night (Chris Van Buren)
13. Atomic Space Orchestra - Starfleet

I am trying to capture GROUP 1, GROUP 2 and GROUP 3 while ignoring REST OF THE STRING

The following regex works well if [delimiter 3] is present:

(\d )\. (.*) - (.*)(?: \()

I am getting "07", "Neospace" and "Into The Night". But for the second string, there is no match, because my last non-capturing group is mandatory. When I'm trying to make last group optional like this:

(\d )\. (.*) - (.*)(?: \()? non-capturing group stops working and I am getting "Into The Night (Chris Van Buren)" for the GROUP 3 - which is NOT what I want.

CodePudding user response:

How about making the third capturing group reluctant?

(\d )\. (.*) - (.*?)(?> \(|$)

This will go only as far as it needs to, until it hits either a left parenthesis or the end of input.

Maybe also trim the whitespace at the end of third group:

(\d )\. (.*) - (.*?)(?> *\(|$)

Note that both these variants might explode quite badly (e.g. try out something like a b c d with lots of spaces between the characters in the 3rd group with the second regex). It's suitable only for scripting, so please don't use it in any user-facing code in production unless you want to end up with a ReDoS vulnerability.

A less dangerous way to accomplish essentially the same would be something like

(\d )\.([^-] )-([^(\n] )

and then manually trim the whitespace. This should not backtrack anywhere.

CodePudding user response:

If the 3rd group has ( as a delimiter, you can use a negated character class to exclude matching a ( char.

Note that using * as a quantifier can also match an empty string between the delimiters.

If the match should be at the start of the string, you can prepend the pattern with ^

(\d )\. (.*?) - ([^(\n]*)

Explanation

  • (\d ) Capture group 1, match 1 digits
  • \. Match .
  • (.*?) Capture group 2, match 0 times any character, as few as possible
  • - Match literally
  • ([^(\n]*) Capture group 3, match 0 times any character except ( or a newline

See a regex demo.

  • Related