Home > Back-end >  UTF-8 encoding java replacement with 3 hex character
UTF-8 encoding java replacement with 3 hex character

Time:04-30

My application is receiving data from an endpoint. I do some processing with the received data and insert it into the database.

At some point I had a problem with the encoding since the data received is UTF-8 and my database uses LATIN1. The error displayed is:

[org.hibernate.util.JDBCExceptionReporter] ERROR: character with byte sequence 0x2020 in encoding "UTF8" has no equivalent in encoding "LATIN1"

In most cases I can solve it by replacing it like this:

String text = org.apache.commons.lang.StringUtils.replace(s, "\u2020", " ");

But now the error is:

ERROR [org.hibernate.util.JDBCExceptionReporter] ERROR: character with byte sequence 0xe2 0x9c 0xa8 in encoding "UTF8" has no equivalent in encoding "LATIN1"

How can I replace this hex? PS: Unfortunately changing the database encoding is not an option. It's a legacy system.

CodePudding user response:

UTF-8 0xe29ca8 is sparkles. Equivalently to how you already handle UTF-8 0x2020, you can manually replace it with any Latin1 character (combination) that resembles three sparkles. There is, however, no canonical way to do this. Essentially, you have to check all characters above 0xFF.

What is somehow weird here is that your code seems to handle Strings and strictly speaking, Java Strings are always UTF-16. However, UTF-8 is a subset of UTF-16, so it is safe to assume that a Java String can represent a sequence of UTF-8 characters.

In a scenario you are describing, you would usually receive byte data and use a CharsetDecoder to convert UTF-8 byte data to a character string and a CharsetEncoder to convert the character string to Latin1 byte data. Both classes have the method onUnmappableCharacter() in order to specify the action to take if a character or byte sequence has no representation in the respective encoding. Receiving the input as String eliminates the possibility to control the encoding of the input bytes and forwarding the output as String eliminates the possibility to control the encoding of the output bytes.

In this specific case, the conversion from a character string to a Latin1 byte string is probably done by the database driver or database itself, over which there seems to be no control. You could work around this by converting the character string to a Latin1 byte string, handling all unmappable characters and then converting it back to a character string in order to pass it to the database driver.

CodePudding user response:

Thanks for the answers, they helped me. I realized that I would have to handle each encoding separately. I'll just complement the answer with the code I made.

  1. I did the accent substitutions (to pt-br);
  2. I applied the replacements of specific encodings by blank space;

And as Izruo commented, I needed to use UTF-16 encoding. In the example I gave (0xE2 0x9C 0xA8) I applied the replace "\u2728". In the "face palm" emoji (with the encoding 0xF0 0x9F 0xA4 0xA6) I applied two substitutions: "\uD83E" and "\uDD26".

public static void main(String[] args) throws Exception {
    String rawString = "Teste com acentuações";
    byte[] bytes = rawString.getBytes(StandardCharsets.UTF_8);

    String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8);
    System.out.println(convertToLatin1(utf8EncodedString));
}

public static String convertToLatin1(String utf8String) throws Exception {
    if (utf8String == null || utf8String.trim().length() == 0) {
        return UtilString.EMPTY;
    }
    String newString = replaceAccents(utf8String);
    newString = StringUtils.replace(newString, "\u2728", " ");
    newString = StringUtils.replace(newString, "\u277C", " ");
    newString = StringUtils.replace(newString, "\u277D", " ");
    newString = StringUtils.replace(newString, "\u277E", " ");
    newString = StringUtils.replace(newString, "\u277F", " ");
    newString = StringUtils.replace(newString, "\u2665", " ");
    newString = StringUtils.replace(newString, "\u2705", " ");
    newString = StringUtils.replace(newString, "\ud83d", " ");
    newString = StringUtils.replace(newString, "\ude4b", " ");
    newString = StringUtils.replace(newString, "\ud83c", " ");
    newString = StringUtils.replace(newString, "\udffb", " ");
    newString = StringUtils.replace(newString, "\u2340", " ");
    newString = StringUtils.replace(newString, "\u200d", " ");
    newString = StringUtils.replace(newString, "\u2640", " ");
    newString = StringUtils.replace(newString, "\ufe0f", " ");
    newString = StringUtils.replace(newString, "\uD83E", " ");
    newString = StringUtils.replace(newString, "\uDD70", " ");
    newString = StringUtils.replace(newString, "\uDD20", " ");
    newString = StringUtils.replace(newString, "\uDD26", " ");
    
    return newString;
}

private static final String PLAIN_ASCII = "AaEeIiOoUu"    // 
          "ÁáÉéÍíÓóÚúÝý"  // 
          "ÂâÊêÎîÔôÛûYy"  // 
          "ÃãEeIiÕõUuYy"  // 
          "ÄäËëÏïÖöÜüYy"  // 
          "Aa"            // 
          "Çç"            // 
          "Nn"            // 
;

private static final String UNICODE = "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9" //
          "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD" //
          "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177" //
          "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177" //
          "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF" //
          "\u00C5\u00E5" // 
          "\u00C7\u00E7" // 
          "\u00D1\u00F1";

public static String replaceAccents(String aString) {
    StringBuilder stringbuilder = new StringBuilder();
    for (int i = 0; i < aString.length(); i  ) {
        char c = aString.charAt(i);
        int position = UNICODE.indexOf(c);
        if (position > -1) {
            stringbuilder.append(PLAIN_ASCII.charAt(position));
        } else {
            stringbuilder.append(c);
        }
    }
    return stringbuilder.toString();
}

So if you are having a similar problem get the UTF-16 encoding (I found it on compart.com, but fileformat.info is ok) and apply the replacement you want.

  • Related