Home > OS >  the regex (?U)\p{Punct} misses some unicode punctuations in java
the regex (?U)\p{Punct} misses some unicode punctuations in java

Time:09-26

First of all,i want to remove all punctuations of a String.I wrote the following code.

Pattern pattern = Pattern.compile("\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()* ,-./:;<=>?@[\\]^_`{|}~(hello)");
if (matcher.find())
    System.out.println(matcher.replaceAll(""));

after repalcement i got the output: (hello)

so the pattern matches the One of !"#$%&'()* ,-./:;<=>?@[]^_`{|}~ which is in accord with the official Docs:https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

But i want to remove "(" Fullwidth Left Parenthesis U FF08* and ")" Fullwidth Right Parenthesis U FF09as well,so i change my code to this:

Pattern pattern = Pattern.compile("(?U)\\p{Punct}");
        Matcher matcher = pattern.matcher("!\"#$%&'()* ,-./:;<=>?@[\\]^_`{|}~()");
        if (matcher.find())
            System.out.println(matcher.replaceAll(""));

after repalcement i got the output: $ <=>^`|~

The matcher indeed match "(" Fullwidth Left Parenthesis U FF08* and ")" Fullwidth Right Parenthesis U FF09

But miss $ <=>^`|~

I am so confused why did that happen? Can anyone give some help? Thanks in advance!

CodePudding user response:

Unicode (that is when you use (?U)) and POSIX (when not using (?U)) disagrees on what counts as a punctuation.

When you don't use (?U), \p{Punct} matches the POSIX punctuation character class, which is just

!"#$%&'()* ,-./:;<=>?@[\]^_`{|}~

When you use (?U), \p{Punct} matches the Unicode Punctuation category, which does not include some of the characters in the above list, namely:

$ <=>^`|~

For example, the Unicode category for $ is "Symbol, Currency", or Sc. See here.

If you want to match $ <=>^`|~, plus all the Unicode punctuations, you can put them both in a character class. You can also just directly use the Unicode category "P", rather than turning on Unicode mode with (?U).

Pattern pattern = Pattern.compile("[\\p{P}$ <=>^`|~]");
Matcher matcher = pattern.matcher("!\"#$%&'()* ,-./:;<=>?@[\\]^_`{|}~()");
// you don't need "find" first
System.out.println(matcher.replaceAll(""));
  • Related