I am trying to do regex parsing and matching and optionally discard the rest of the string.
My strings are of type:
[GROUP 1][delimiter 1][GROUP 2][delimiter 2][GROUP 3][delimiter 3 - optional][REST OF THE STRING - optional]
For example:
07. Neospace - Into The Night (Chris Van Buren)
13. Atomic Space Orchestra - Starfleet
I am trying to capture GROUP 1, GROUP 2 and GROUP 3 while ignoring REST OF THE STRING
The following regex works well if [delimiter 3] is present:
(\d )\. (.*) - (.*)(?: \()
I am getting "07
", "Neospace
" and "Into The Night
".
But for the second string, there is no match, because my last non-capturing group is mandatory.
When I'm trying to make last group optional like this:
(\d )\. (.*) - (.*)(?: \()?
non-capturing group stops working and I am getting "Into The Night (Chris Van Buren)
" for the GROUP 3 - which is NOT what I want.
CodePudding user response:
How about making the third capturing group reluctant?
(\d )\. (.*) - (.*?)(?> \(|$)
This will go only as far as it needs to, until it hits either a left parenthesis or the end of input.
Maybe also trim the whitespace at the end of third group:
(\d )\. (.*) - (.*?)(?> *\(|$)
Note that both these variants might explode quite badly (e.g. try out something like a b c d
with lots of spaces between the characters in the 3rd group with the second regex). It's suitable only for scripting, so please don't use it in any user-facing code in production unless you want to end up with a ReDoS vulnerability.
A less dangerous way to accomplish essentially the same would be something like
(\d )\.([^-] )-([^(\n] )
and then manually trim the whitespace. This should not backtrack anywhere.
CodePudding user response:
If the 3rd group has (
as a delimiter, you can use a negated character class to exclude matching a (
char.
Note that using *
as a quantifier can also match an empty string between the delimiters.
If the match should be at the start of the string, you can prepend the pattern with ^
(\d )\. (.*?) - ([^(\n]*)
Explanation
(\d )
Capture group 1, match 1 digits\.
Match.
(.*?)
Capture group 2, match 0 times any character, as few as possible-
Match literally([^(\n]*)
Capture group 3, match 0 times any character except(
or a newline
See a regex demo.