Java REGEX: include new line and place results in an array-CodePudding

I have a raw text that looks like this:

#John
age: 25
skill: boxer

#Peter
age: 25
skill: fisher

#James
age: 25
skill: bouncer

I intend to separate each block and put each in an array.

My problem is how to get a match using regex that says "get all matching text that start with '#' and ends with '#'.

My purpose is so that I can fetch John's block separate from Peter's block and James' block.

If I use this:

String    regex = "#(.*)";
List<String> matches = Pattern.compile( regex, Pattern.MULTILINE)   
                    .matcher(raw)
                    .results()
                    .map(MatchResult::group)
                    .collect(Collectors.toList());

The array only contains:

index 0: #John
index 1: #Peter
index 2: #James

which is incomplete because it does not include the 'age' and 'skill' part of the body. My desired outcome is this:

index 0: #John
         age: 25
         skill: boxer

index 1: #Peter
         age: 25
         skill: fisher

index 2: #James
         age: 25
         skill: bouncer

Can you please help?

CodePudding user response：

Using a formal regex pattern matcher, we can try the following regex find all approach:

String input = "#John\nage: 25\nskill: boxer\n\n#Peter\nage: 25\nskill: fisher\n\n#James\nage: 25\nskill: bouncer";
List<String> items = new ArrayList<>();
String pattern = "(?s)(#.*?)\\s*(?=#|$)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
int index = 0;
while (m.find()) {
    items.add(m.group(1));
    System.out.println("index "   index     ": "   m.group(1));
}

This prints:

index 0: #John
age: 25
skill: boxer
index 1: #Peter
age: 25
skill: fisher
index 2: #James
age: 25
skill: bouncer

The regex patten used says to match:

(?s)             enable dot all mode, so dot matches across newlines
(                capture what follows
#                match a starting #
.*?              then match all content until reaching the nearest
)                end capture
\\s*             optional whitespace
(?=#|$)          followed by either the next # or end of the input