I am looking for certain entries with special words in a string. The string looks like this.
entry 1: hello
entry 2: world
entry 3: this
is a multiline
that makes it hard
entry 4: here we have a special entry
entry 5: here
we
have
another special entry
in a multiline
entry 6: end
Because it is an multiline problem I use Java's DOTALL so that the .
matches also newline characters.
I am looking for entries that have the word special in it.
First I tried to find a regex, that captures a full entry: entry \d : .*?(?=\s*(entry \d: )|\Z)
. That is like a simplified version of this
Then I thought, ok I just have to exchange the .*?
for the regex I need to find. But entry \d : .*?special.*?(?=\s*(entry \d: )|\Z)
does not work, probably because the special breaks the greed of the expression.
Does anyone know a better solution?
CodePudding user response:
You can use a tempered greedy token:
(?s)entry \d : (?:(?!entry \d : ).)*special.*?(?=\s*entry \d : |$)
See the regex demo. Details:
entry \d :
-entry
space one or more digits,:
, space(?:(?!entry \d : ).)*
- any char, repeated zero or more times, that does not start theentry
space one or more digits,:
, space sequencespecial
- a fixed string.*?
- any zero or more chars as few as possible(?=\s*entry \d : |$)
- a positive lookahead that matches a location in string that is immediately followed with zero or more whitespaces,entry
, space, one or more digits,:
and space, or end of the string.
NOTE: Do not use Pattern.MULTILINE
with this regex. Or, keep on using \Z
(end of the string, or position right before the trailing newline, LF char).
CodePudding user response:
If you use words and space classes instead of dots then it seems to work
/entry \d : [\w\s]*special[\w\s]*?(?=\s*(?:entry \d :)|$)/gm
It seems that if you allow the colon :
in your text, it breaks the expression.
And also you have \Z
in your expression but it seems to me that end of line $
is more suited here
CodePudding user response:
[Edit:] I unfortunately missed the multiline nature of entries, so this answer is valid for single line entries but will return only the first line for multiline entries. I think one could overcome this by setting a certain regex for delimiter, though.
I'd suggest you use a Scanner
to deal with the multi line aspect. This will give you a stream of tokens (the lines). You can use a String.contains(...)
or a String.matches(...)
to filter tokens then.
var result = new Scanner(myMultiLineString).tokens()
.useDelimiter("\\n")
// alternatively use String.contains(...)
// if you're looking for a constant
// rather than a complex rule.
.filter(s -> s.matches(regex))
.collect(Collectors.toList());