Why is find() with Regex ^[a-z]$ not equivalent to matches() with Regex [a-z]?-CodePudding

Java's Matcher is the engine that performs match operations on a character sequence by interpreting a Pattern (Regular Expression). This class has two well known operations:

Matcher.find() which scans the input sequence looking for the next subsequence that matches the pattern.
Matcher.matches() which attempts to match the entire input sequence against the pattern.

In other words, find() should be used to match a substring whereas matches() should be used to match the entire input. This got me thinking that using find() with a Regex like ^[a-z]$ is equivalent to using matches() with a Regex like [a-z], so I went ahead and tested that.

Click here to run below code online.

import java.util.List;
import java.util.regex.Pattern;

public class Main
{
    public static void main(String[] args) {
        Pattern sub = Pattern.compile("[a-z] ");
        Pattern all = Pattern.compile("^[a-z] $");
        List<String> tests = List.of("", "  ", "a", "A", "abc", "a\r", 
                                     "a\r\n", "a\n", " a", "\na", "\ra\n", 
                                     "\r\na", "\na");
        for (String test : tests) {
            boolean matchesSub = sub.matcher(test).matches();
            boolean matchesAll = all.matcher(test).find();
            System.out.printf("%s\t%s\t%s", format(test), matchesSub, matchesAll);
            System.out.println();
        }
    }

    private static String format(String input) {
        return input.replace("\r", "\\r").replace("\n", "\\n");
    }
}

Which produced the following output:

        false   false
        false   false
a       true    true
A       false   false
abc     true    true
a\r     false   true
a\r\n   false   true
a\n     false   true
 a      false   false
\na     false   false
\ra\n   false   false
\r\na   false   false
\na     false   false

Interestingly enough, this test fails for a\r, a\r\n and a\n:

using matches() with [a-z] on these cases produces false. Apparently the line break at the end is counted as a character, failing the test.
using find() with ^[a-z] $ on these cases produces true. Apparently the line break at the end is ignored, passing the test.

This only holds true when the line break is at the end, not at the beginning though, as \r\na is treated the same by both methods.

What's going on?

CodePudding user response：

^ and $ mean different things depending on which mode you're running your regexp in. See the Pattern.MULTILINE flag's javadoc.

In any case, ^ and $ never consume anything.

The way regex engines work, is that everything in the regexp can 'match' or 'not match' and usually as part of matching, they also consume characters.

You can think about it as a cursor that, just like your text cursor is always in between characters, and the regexp engine will go from left to right through your regexp, starting the cursor at the beginning of input, and for each item in the regexp pattern, that item either matches or fails, and usually but not always, moves the cursor forward.

^ and $ can match or fail, but they cannot move the cursor. It's the same as e.g. \b (matches on a 'word break'), or (positive/negative) look-(ahead/behind) in that way. The relevant trickery here is that for the matches() case, every character must be consumed - the matching process must end such that the cursor is at the very end. Your pattern can only consume lowercase letters (only forward the cursor when there are lowercase letters), so the moment you toss any character in your string that isn't one of those (so even one \r or \n, in any position), it couldn't possibly match; there is no way to consume these non-lowercase characters.

With find(), on the other hand, you don't need to consume all characters; you merely need for a substring to match up, that is all.

Which then gets us to: Which 'states' in the string are considered as 'matching' the ^ state, and which ones are considered as 'matching' the $ state. The answer is partly dependent on whether MULTILINE mode is on. It's off in your code snippet; you can turn it on by making your regexes using Pattern.compile(patternString, Pattern.MULTILINE), or by tossing (?m) inside your regexp string ((?xyz) enables/disables flags from the point that shows up in your pattern string, and has no effect otherwise (always matches, consumes nothing - that's regexp-engine-ese for: Doesn't do anything whatsoever).

Even the UNIX_LINES has an effect on this (with UNIX_LINES mode on, only \n is considered a line termination, and ^/$ will match whenever you're on a line termination if you're in MULTILINE mode.

In multiline mode, all your examples trivially match; ^ is 'true' anytime the cursor is either at start-of-input (the cursor is always in between characters; if it's in between the start and the first character (i.e. before the first character), it is considered to match) - or if you're in between a newline character and the thing that immediately follows it, as long as that thing isn't the end of the entire input. \r and \n all count (because UNIX_LINES is off).

But you're not in MULTILINE mode, so what in the blazes is going on?

What's going on is that the docs are wrong. As @MartinDevillers excellent digging around for the relevant bug entries shows.

The docs are only slightly wrong. Specifically, the regex engine is trying to be a little more intelligent than the rather rote:

From the javadoc of the regular expression package:

By default these expressions only match at the beginning and the end of the entire input sequence.

And that's just plain hogwash. It's more intelligent than that: They also match when your cursor is in between a character and exactly one newline, though any of \r, \n, and \r\n are all considered 'one newline', as long as that one newline is the final thing in the entire input. In other words, given (where every space isn't real; I'm making room to show where cursors can be, which can only be between chars, so I can stick a marker below them to show where things match):

" h e l l o \r \n "
           ^  ^  ^

The matching system considers $ matched in any of the ^ places. Let's test that theory:

Pattern p = Pattern.compile("hello$");
System.out.println(p.matcher("hello\r\n\n").find());
System.out.println(p.matcher("hello\r\n").find());
System.out.println(p.matcher("hello\r").find());
System.out.println(p.matcher("hello\n").find());
System.out.println(p.matcher("hello\n\n").find());

This prints false, true, true, true, false. The middle 3 all have a character (or characters) at the end that are considered 'a single newline' on at least one major OS (\n is posix/unix/macosx, \r\n is windows, \r is classic mac which I don't think ever ran a JVM, and nobody uses anymore, but its still considered 'a newline' by most rules for grandfathering reasons I guess).

That's all you're missing here.

CONCLUSION:

The docs are slightly wrong, and $ is smarter than merely 'matches at very end of input'; it acknowledges that sometimes input has a stray newline hanging off of the end of it, and $ won't get confused by this. But matches() will get confused by a dangling newline at the very end though - it has to consume everything or it isn't considered matched.

CodePudding user response：

As @WiktorStribiżew answered in his comment, matches() with [a-z] is NOT equivalent to find() with ^[a-z] $, however it is equivalent to find() with ^[a-z] \\z. This is because $ treats a single trailing newline as a special case: it ignores it. \z is not so forgiving.

This behavior isn't documented clearly in the official Java documentation. Moreover, there's an open bug report in the JDK currently under investigation which specifically deals with the $ matcher, trailing newlines and the find() method. Also, judging by these other older reports it's at the minimum confusing: JDK-8218146 JDK-8059325 JDK-8058923 JDK-8049849 JDK-8043255

Finally, this behavior is not the same in all RegEx implementations:

In all major engines except JavaScript, if the string has one final line break, the $ anchor can match there. For instance, in the apple\n, e$ matches the final e.