I am trying to convert codepoints from one charset to another in Java.
For example character ř
is 248 in windows-1250
, 345 in unicode
.
So I have source charset and source codepoint and target charset and want to calculate target codepoint.
This may sound easy as windows-1250 is single byte,
but I want it to work on any charset, like GB2312
.
I guess it can be done somehow with Charset
class,
but it seems that it only converts bytes, not actual code points.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; //吧 chinese character
Charset targetCharset = Charset.forName("UTF-8");
int targetCodePoint = ...; //???
I checked Charset class for methods codepoint related, but there's only decode and encode, which works with bytes. I tried googling something related but without success.
Thanks in advance for any help.
CodePudding user response:
At least in Java there is no notion of codepoints for character sets other than Unicode. You have to convert the integer to byte array and then to unicode.
Charset sourceCharset = Charset.forName("windows-1250");
int sourceCodePoint = 248; // ř
byte[] bytes = {(byte)sourceCodePoint};
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " targetString);
System.out.println("targetCodePoint = " targetCodePoint);
output:
targetString = ř
targetCodePoint = 345
Chinese characters in GB2312 are represented by 2 bytes, so you need to store them in a byte array of length 2.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; // 吧 chinese character
byte[] bytes = ByteBuffer.allocate(2).putShort((short)sourceCodePoint).array();
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " targetString);
System.out.println("targetCodePoint = " targetCodePoint);
output:
targetString = 吧
targetCodePoint = 21543