First of all,i want to remove all punctuations of a String.I wrote the following code.
Pattern pattern = Pattern.compile("\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()* ,-./:;<=>?@[\\]^_`{|}~(hello)");
if (matcher.find())
System.out.println(matcher.replaceAll(""));
after repalcement i got the output: (hello)
so the pattern matches the One of !"#$%&'()* ,-./:;<=>?@[]^_`{|}~ which is in accord with the official Docs:https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
But i want to remove "(" Fullwidth Left Parenthesis U FF08*
and ")" Fullwidth Right Parenthesis U FF09
as well,so i change my code to this:
Pattern pattern = Pattern.compile("(?U)\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()* ,-./:;<=>?@[\\]^_`{|}~()");
if (matcher.find())
System.out.println(matcher.replaceAll(""));
after repalcement i got the output: $ <=>^`|~
The matcher indeed match "(" Fullwidth Left Parenthesis U FF08*
and ")" Fullwidth Right Parenthesis U FF09
But miss $ <=>^`|~
I am so confused why did that happen? Can anyone give some help? Thanks in advance!
CodePudding user response:
Unicode (that is when you use (?U)
) and POSIX (when not using (?U)
) disagrees on what counts as a punctuation.
When you don't use (?U)
, \p{Punct}
matches the POSIX punctuation character class, which is just
!"#$%&'()* ,-./:;<=>?@[\]^_`{|}~
When you use (?U)
, \p{Punct}
matches the Unicode Punctuation category, which does not include some of the characters in the above list, namely:
$ <=>^`|~
For example, the Unicode category for $
is "Symbol, Currency", or Sc. See here.
If you want to match $ <=>^`|~, plus all the Unicode punctuations, you can put them both in a character class. You can also just directly use the Unicode category "P", rather than turning on Unicode mode with (?U)
.
Pattern pattern = Pattern.compile("[\\p{P}$ <=>^`|~]");
Matcher matcher = pattern.matcher("!\"#$%&'()* ,-./:;<=>?@[\\]^_`{|}~()");
// you don't need "find" first
System.out.println(matcher.replaceAll(""));