I am using the regex [^\\p{L}]
and java.util.regex.Matcher#replaceAll(String)
to match and remove all non-letter characters from a string. I noticed that for characters containing UTF-16 surrogates, replaceAll()
creates a structurally invalid string (OpenJDK Runtime Environment (build 11.0.6 10-post-Ubuntu-1ubuntu118.04.1).
First a working example:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Main {
public static void main(String args[]) {
Pattern p = Pattern.compile("[^\\p{L}]");
System.out.println(p.matcher("abcဍ*").replaceAll(""));
}
}
The above program prints abcဍ
as expected (ဍ is MYANMAR LETTER DDA).
Now let's test the character "