Different Java Regex matching behavior when using UNICODE_CHARACTER

I was testing the behavior of the Pattern.UNICODE_CHARACTER_CLASS flag for different punctuation characters and noticed that the matches for grave accent character (U 0060) ` occur differently depending on whether Pattern.UNICODE_CHARACTER_CLASS is used.

For example, see the below code:


public class GraceAccentTest {
    public static void main(String args[]) {
       Pattern p = Pattern.compile("\\p{Punct}");
       Matcher m = p.matcher("`");
       System.out.println(m.matches()); // returns true
       
       Pattern p1 = Pattern.compile("\\p{Punct}", Pattern.UNICODE_CHARACTER_CLASS);
       Matcher m1 = p1.matcher("`");
       System.out.println(m1.matches()); // returns false 
    }
}

When I don't use Pattern.UNICODE_CHARACTER_CLASS flag grave accent character matches with \p{Punct} character class but when I use the flag it doesn't match. Can someone explain the reasoning for this ?

CodePudding user response：

Reading the documentation for UNICODE_CHARACTER_CLASS

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

So this is saying that is using US-ASCII only. So if you check the table of characters Punctuation you will check there is a lot of missing chars.

Tables :

https://www.fileformat.info/info/unicode/category/Po/list.htm

https://www.gaijin.at/en/infos/unicode-character-table-punctuation

CodePudding user response：

When you use Pattern p = Pattern.compile("\\p{Punct}");, then \p{Punct} refers to the following 32 characters:

!"#$%&'()* ,-./:;<=>?@[\]^_`{|}~

Reference: the Pattern class.

These 32 characters correspond to the ASCII character set characters 0x21 through 0x7e, excluding letters and digits. They also happen to represent all the non-letter and non-digit symbols on my standard U.S. keyboard (your keyboard may be different, of course).

The grave accent (also known as a backtick) is in that list and on my keyboard.

That is a simple example of a "predefined character class" - and explains why your m.matches() returns true.

When you add the Pattern.UNICODE_CHARACTER_CLASS flag things get more complicated.

As the documentation for this flag explains, it:

Enables the Unicode version of Predefined character classes and POSIX character classes.

and:

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expressions Annex C: Compatibility Properties.

Looking at the Annex C referred to above, we find a table showing the "recommended assignments for compatibility property names".

For our property name (punct), the standard recommendation is to use characters defined by this:

\p{gc=Punctuation}

Here, "gc" stands for "general category". Unicode characters are assigned a "general category" value. In this case, that is Punctuation - also abbreviated to P and further broken down into various sub-categories such as Pc for connectors, Pd for dashes, and so on. There is also a catch-all Po for "other punctuation characters".

The grave character is assigned to the Symbol general category in Unicode - and to the Modifier subcategory. You can see that assignment to Sk here.

Contrast that with a character such as the ASCII exclamation mark (also part of our original \p{Punct} list, shown above). For that we can see that the general category assignment is Po.

That explains why the grave is no longer matched when we add the Pattern.UNICODE_CHARACTER_CLASS flag to our original pattern.

It is assigned to a different general category from the punctuation category we are using in our regex.

The obvious next question is why did the grave character not get included in the Unicode Po general category? Why is it in Sk instead?

I do not have a good answer for that - I'm sure there are "historical reasons". It's worth noting, however, that the Sk cateogry includes characters such as the acute accent, the cedilla, the diaeresis, and so on - and (as already noted) our grave accent.

All these are diacritics - typically used in combination with a base letter to alter the pronunciation. So maybe that is the underlying reason.

The grave is a bit of an oddity, perhaps, given it has a historical usage outside of being used as a diacritic.

It may be more relevant to ask how the grave ended up as part of the original ASCII character set, in the first place. Some background about this is provided in the Wikipedia page for the backtick.