I have a service that receives free text, such as name, surname, address, etc. and I want to throw an error if one of the characters sent doesn't belong to the windows 1252 character set but I don't know how to do so in a proper way. What I was thinking about is a regex, but not sure if that is the best option.
The regex would be the letters from cp1252 with any other letter \\w
, so, something like this:
String test = "ŠŒŽšœžŸÀÁÂà ÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ Þßàáâãäåæçèéêëìíîïð ñòóôõöøùúûüýþÿ asvsdf QWESA 1234 ÜüËëÄäÖö";
System.out.println(test.matches("[ŠŒŽšœžŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ\\w"
"\\d\\s\\.] "));
I don't need to detect the encoding itself, only if it doesn't belong to the charset.
CodePudding user response:
You need to check if its Unicode code point is outside the range of 0x20 to 0x7E AND 0xA0 to 0xFF, which covers all of the printable ASCII characters and the extended characters in the Windows 1252 character set, excluding the EN dash. Something like this:
String input = "any text goes here";
for (int i = 0; i < input.length(); i )
{
char c = input.charAt(i);
if (c < 0x20 || (c > 0x7E && c < 0xA0) || c > 0xFF && c != '\u2013')
{
throw new IllegalArgumentException("Character at index " i " does not belong to the Windows 1252 character set: " c);
}
}
CodePudding user response:
My suggestion as code:
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
public class Windows1252Tester {
public static void main(String[] args) {
try {
// Can we encode the incoming UTF-8 (per OP) as Windows-1252?
Charset cs = Charset.forName("Windows-1252");
CharsetEncoder enc = cs.newEncoder();
System.out.printf("Can charset %s encode sequence %s? %b%n", cs, args[0], enc.canEncode(args[0]));
}
catch(Throwable t) {
t.printStackTrace();
}
}
}