Home > Software engineering >  Java regular expression find() matching a zero-length string
Java regular expression find() matching a zero-length string

Time:09-03

The Javadoc for java.util.regex.Matcher.find() says:

This method starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.

My own experiments suggest that if the regex matches a zero length string, then the method starts one character beyond the end of the previous (zero-length) match. For example, given an input string "abcabc" and a regex "a|b?", successive matches will occur at positions (0, 1, 2, 3, 4, 5, 6). Following a successful match at position 6, there is an unsuccessful match at position 6; and further calls also return false with position remaining at 6.

The documentation suggests that after finding a zero-length match at position 2 (character "c"), the next call on find will start at "the first character not matched", which is still at position 2.

The documentation also suggests that after an unsuccessful match at position 6, the next call should start at position 0, which doesn't appear to be the case.

Is the documentation simply wrong?

Is there a more precise rule somewhere, for example one that covers the behaviour at position 6?

My reason for asking is that I am trying to write an emulation of the method in C#, and I'm having trouble reproducing the precise behaviour by trial and error.

CodePudding user response:

This is a special case handled by Matcher.find():

public boolean find() {
    int nextSearchIndex = last;
    if (nextSearchIndex == first)
        nextSearchIndex  ;

    // If next search starts before region, start it at region
    if (nextSearchIndex < from)
        nextSearchIndex = from;

    // If next search starts beyond region then it fails
    if (nextSearchIndex > to) {
        for (int i = 0; i < groups.length; i  )
            groups[i] = -1;
        return false;
    }
    return search(nextSearchIndex);
}

If the last match was a zero-length match (indicated by first == last) then it starts the next search at last 1.

If you think about it this makes sense. Without this, if the Matcher once finds a zero-length match it would never proceed over this match.


So yes, the documentation is incomplete: it neither mentions the special case for zero length matches nor that it only ever passes once over the input string.

But IMHO both special cases are implied by "Attempts to find the next subsequence of the input sequence that matches the pattern."

After a zero length match the next possible subsequence must start one position after the previous match, and the next subsequence can never start at a position lower than the current position (i.e. start over from the beginning).

CodePudding user response:

You can find the OpenJDK implementation of the find method here.

public boolean find() {
    int nextSearchIndex = last;
    if (nextSearchIndex == first)
        nextSearchIndex  ;

    // If next search starts before region, start it at region
    if (nextSearchIndex < from)
        nextSearchIndex = from;

    // If next search starts beyond region then it fails
    if (nextSearchIndex > to) {
        for (int i = 0; i < groups.length; i  )
            groups[i] = -1;
        return false;
    }
    return search(nextSearchIndex);
}

first and last store the range of the previous successful match, and from and to store the range of the range of the string to be matched.

/**
 * The range within the sequence that is to be matched. Anchors
 * will match at these "hard" boundaries. Changing the region
 * changes these values.
 */
int from, to;

/**
 * The range of string that last matched the pattern. If the last
 * match failed then first is -1; last initially holds 0 then it
 * holds the index of the end of the last match (which is where the
 * next search starts).
 */
int first = -1, last = 0;

The behaviours that are relevant to your question are the first and third if statements.

As you can clearly see, the first if checks if the previous match was a 0-length match, and if it is, moves nextSearchIndex one char forward.

The third if checks if nextSearchIndex goes out of range of the string. This happens after you found all 7 matches of "a|b?" in "abcabc".

From what I can see, the documentation doesn't suggest that "after an unsuccessful match at position 6, the next call should start at position 0". At best, the documentation doesn't say anything about what happens after an unsuccessful match.

It could also be argued that "a previous invocation of the method" refers to an invocation of the method at any previous time, not just the immediately prior invocation, in which case what happens after an unsuccessful match is very well-defined:

  • if there were no previous successful matches at all, start at the beginning
  • if there was a previous successful match, and it doesn't matter how many failed matches you have since that match, start at the next character not matched by that match.
  • Related