Regex doesnt stop after sign-CodePudding

Hi I have regex like this

(.*(?=\sI )*) (.*)

But it doesn't capture groups correctly as I need.

For this example data :

Vladimir Goth
Langraab II Landgraab
Léa Magdalena III Rouault Something
Anna Maria Teodora
Léa Maria Teodora II

1,2 are only correctly captured.

So what I need is

If there is no I is split by first space.
If after I there are other words first gorup should contains all to I . So, group1 for 3rd example should be Léa Magdalena III
If after I there aren't any other words like in example 5, group1 should be capture to first space.

@Edit I should be replaced by roman numbers

CodePudding user response：

If you want to support any Roman numbers you can use

^(\S (?:.*\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\b(?=  \S))?)  (.*)

If you need to support Roman numbers up to XX (exclusive):

^(\S (?:.*\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\b(?=  \S))?)  (.*)

See the regex demo #1 and demo #2. Replace spaces with \h or \s in the Java code and double backslashes in the Java string literal.

Details:

^ - start of string
( - Group 1 start:
- \S - one or more non-whitespaces
- (?: - a non-capturing group:
  - .* - any zero or more chars other than line break chars as many as possible
  - \b - a word boundary
  - (?=[MDCLXVI]) - require at least one Roman digit immediately to the right
  - M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}) - a Roman number pattern
  - \b - a word boundary
  - (?= \S) - a positive lookahead that requires one or more spaces and then one non-whitespace right after the current position
- )? - end of the non-capturing group, repeat one or zero times (it is optional)
) - end of the first group
- one or more spaces
(.*) - Group 2: the rest of the line.

In Java:

String regex = "^(\\S (?:.*\\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\\b(?=\\h \\S))?)\\h (.*)";
// Or
String regex = "^(\\S (?:.*\\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\\b(?=\\s \S))?)\\s (.*)";