Home > OS >  Regex for finding special entries in a multiline setting
Regex for finding special entries in a multiline setting

Time:12-08

I am looking for certain entries with special words in a string. The string looks like this.

entry 1: hello
entry 2: world
entry 3: this
is a multiline
that makes it hard
entry 4: here we have a special entry
entry 5: here
we
have 
another special entry
in a multiline
entry 6: end

Because it is an multiline problem I use Java's DOTALL so that the . matches also newline characters.

I am looking for entries that have the word special in it.

First I tried to find a regex, that captures a full entry: entry \d : .*?(?=\s*(entry \d: )|\Z). That is like a simplified version of this

Then I thought, ok I just have to exchange the .*? for the regex I need to find. But entry \d : .*?special.*?(?=\s*(entry \d: )|\Z) does not work, probably because the special breaks the greed of the expression.

Does anyone know a better solution?

CodePudding user response:

You can use a tempered greedy token:

(?s)entry \d : (?:(?!entry \d : ).)*special.*?(?=\s*entry \d : |$)

See the regex demo. Details:

  • entry \d : - entry space one or more digits, :, space
  • (?:(?!entry \d : ).)* - any char, repeated zero or more times, that does not start the entry space one or more digits, :, space sequence
  • special - a fixed string
  • .*? - any zero or more chars as few as possible
  • (?=\s*entry \d : |$) - a positive lookahead that matches a location in string that is immediately followed with zero or more whitespaces, entry, space, one or more digits, : and space, or end of the string.

NOTE: Do not use Pattern.MULTILINE with this regex. Or, keep on using \Z (end of the string, or position right before the trailing newline, LF char).

CodePudding user response:

If you use words and space classes instead of dots then it seems to work

/entry \d : [\w\s]*special[\w\s]*?(?=\s*(?:entry \d :)|$)/gm

It seems that if you allow the colon : in your text, it breaks the expression.

And also you have \Z in your expression but it seems to me that end of line $ is more suited here

CodePudding user response:

[Edit:] I unfortunately missed the multiline nature of entries, so this answer is valid for single line entries but will return only the first line for multiline entries. I think one could overcome this by setting a certain regex for delimiter, though.

I'd suggest you use a Scanner to deal with the multi line aspect. This will give you a stream of tokens (the lines). You can use a String.contains(...) or a String.matches(...) to filter tokens then.

var result = new Scanner(myMultiLineString).tokens()
                                           .useDelimiter("\\n")
                                           // alternatively use String.contains(...)
                                           // if you're looking for a constant
                                           // rather than a complex rule.
                                           .filter(s -> s.matches(regex))
                                           .collect(Collectors.toList());
  • Related