Home > OS >  Removing non-letter Characters from String leaves UTF-16 High Surrogates
Removing non-letter Characters from String leaves UTF-16 High Surrogates

Time:03-03

I am using the regex [^\\p{L}] and java.util.regex.Matcher#replaceAll(String) to match and remove all non-letter characters from a string. I noticed that for characters containing UTF-16 surrogates, replaceAll() creates a structurally invalid string (OpenJDK Runtime Environment (build 11.0.6 10-post-Ubuntu-1ubuntu118.04.1).

First a working example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Main {  
  public static void main(String args[]) { 
    Pattern p = Pattern.compile("[^\\p{L}]");
    System.out.println(p.matcher("abcဍ*").replaceAll(""));
  } 
}

The above program prints abcဍ as expected (ဍ is MYANMAR LETTER DDA).

Now let's test the character "

  • Related