I have a raw text that looks like this:
#John
age: 25
skill: boxer
#Peter
age: 25
skill: fisher
#James
age: 25
skill: bouncer
I intend to separate each block and put each in an array.
My problem is how to get a match using regex that says "get all matching text that start with '#' and ends with '#'.
My purpose is so that I can fetch John's block separate from Peter's block and James' block.
If I use this:
String regex = "#(.*)";
List<String> matches = Pattern.compile( regex, Pattern.MULTILINE)
.matcher(raw)
.results()
.map(MatchResult::group)
.collect(Collectors.toList());
The array only contains:
index 0: #John
index 1: #Peter
index 2: #James
which is incomplete because it does not include the 'age' and 'skill' part of the body. My desired outcome is this:
index 0: #John
age: 25
skill: boxer
index 1: #Peter
age: 25
skill: fisher
index 2: #James
age: 25
skill: bouncer
Can you please help?
CodePudding user response:
Using a formal regex pattern matcher, we can try the following regex find all approach:
String input = "#John\nage: 25\nskill: boxer\n\n#Peter\nage: 25\nskill: fisher\n\n#James\nage: 25\nskill: bouncer";
List<String> items = new ArrayList<>();
String pattern = "(?s)(#.*?)\\s*(?=#|$)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
int index = 0;
while (m.find()) {
items.add(m.group(1));
System.out.println("index " index ": " m.group(1));
}
This prints:
index 0: #John
age: 25
skill: boxer
index 1: #Peter
age: 25
skill: fisher
index 2: #James
age: 25
skill: bouncer
The regex patten used says to match:
(?s) enable dot all mode, so dot matches across newlines
( capture what follows
# match a starting #
.*? then match all content until reaching the nearest
) end capture
\\s* optional whitespace
(?=#|$) followed by either the next # or end of the input