Decode Korean String from Bytes in Java-CodePudding

I ran into struggles converting a byte array to UTF_8 korean chars in Java. Wikipedia states that somehow 3 bytes are beeing used for each char, but not all bits are taken into account.

Is there a simple way of converting this very special...format? I don't want to write loops and counters keeping track of bits and bytes, as it would get messy and I can't imagine that there is no simple solution. A native java lib would be perfect, or maybe someone figured some smart bitshift logic out.

UPDATE:

These bytes

[91, -80, -8, -69, -25, 93, 32, -64, -78, -80, -18, -73, -50]

should output this:

[공사] 율곡로

But using

new String(shortStrBytes,"UTF8"); // or
new String(shortStrBytes,StandardCharsets.UTF_8);

turns them to this:

[����] �����
The returned string has 50% more chars

CodePudding user response：

You should use StandardCharsets.UTF_8. Converting from String to byte[] and vice versa:

import java.util.*;
import java.nio.charset.StandardCharsets;

public class Translater {

    public static String translateBytesToString(byte[] b) {
      return new String(b, StandardCharsets.UTF_8);
    }

    public static byte[] translateStringToBytes(String s) {
      return s.getBytes(StandardCharsets.UTF_8);
    }

    public static void main(String[] args) {
        final String STRING = "[공사] 율곡로";
        final byte[] BYTES = {91, -22, -77, -75, -20, -126, -84, 93, 32, -20, -100, -88, -22, -77, -95, -21, -95, -100};
    
        String s = translateBytesToString(BYTES);
        byte[] b = translateStringToBytes(STRING);
    
        System.out.println("String: "   translateBytesToString(BYTES));
        System.out.print("Bytes: ");
        for (int i=0; i<b.length; i  )
           System.out.print(b[i]   " ");
    }
}

CodePudding user response：

Since you added the bytes to the question, I have done a little research and some experimenting, and I believe that the text you have is encoded as EUC-KR. I got the expected Korean characters when interpreting them as that encoding.

// convert bytes to a Java String
byte[] data = {91, -80, -8, -69, -25, 93, 32, -64, -78, -80, -18, -73, -50};
String str = new String(data, "EUC-KR");

// now convert String to UTF-8 bytes
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8);
System.out.println(HexFormat.ofDelimiter(" ").formatHex(utf8));

This prints the following hexadecimal values:

5b ea b3 b5 ec 82 ac 5d 20 ec 9c a8 ea b3 a1 eb a1 9c

Which is the proper UTF-8 encoding of those Korean characters and, with a terminal that supported them, printing the string should display them properly, too.