SQL Server JDBC: Latin1 to UTF-8-CodePudding

There is a table in SQL Server with collation SQL_Latin1_General_CP1_CS_AS. The table has a column varchar(35) with the same collation SQL_Latin1_General_CP1_CS_AS.

The column contains a string with the character 8f (hexadecimal).

See https://www.fileformat.info/info/unicode/char/008f/index.htm According to this page, this character converted into UTF8 should become c28f.

When I read the value from this column in Java and convert it to UTF-8, the 8f is replaced with efbfbd. So the 8f get's lost... a kind of. See https://www.fileformat.info/info/unicode/char/0fffd/index.htm

     public static String convertStrToHex(String str) {
         byte[] getBytesFromString = str.getBytes(StandardCharsets.UTF_8);
            
         BigInteger bigInteger = new BigInteger(1, getBytesFromString);
         String convertedResult = String.format("%X", bigInteger);
        
         return convertedResult;
     }

When I query the table

select BadCol from MyTbl
System.out.println(convertStrToHex(resultSet.getString(1));

I get EFBFBD and not C28F.

When I declare a string variable "\u008f" and convert it in UTF-8:

String code="\u008f";
System.out.println(convertStrToHex(code);

I get correctly C28F.

So, why is a variable gets converted correctly, but over JDBC->RecordSet wrongly?

Tested with SQL Server 2017 and 2019 and JDBC: mssql and jTDS with the same result.

I would appreciate any help! As I understand, the JDBC driver is to blame. But why??

CodePudding user response：

The character with the hexadecimal code 8f does not exist in Latin-1. It's an invalid character.

Thus, when converting to UTF-8, it is replaced with the replacement character �.

The replacement character has the Unicode code point U FFFD. Encoded in UTF-8, it becomes EF BF BD.

CodePudding user response：

You are correct that 8f is not a valid UTF-8 byte. 8f also is not a valid Latin1 character.

8f is a valid character in some Windows charsets, which are supersets of ISO 8859-n charsets. Your varchar value is probably a Windows-1250, Windows-1251, Windows-1256, or Windows-1257 value. You will have to make an assumption based on the language of your users or, less ideally, the default language of your software.

If possible, set your JDBC connection to use one of those charsets. (Exactly how that is done will depend on which database you are using. For instance, I believe MySQL allows characterEncoding=windows-1250 as a query parameter in a JDBC URL.)

If you can’t do that, do the conversion yourself when reading the value from the database. Replace this:

resultSet.getString(1)

with one of these:

new String(resultSet.getBytes(1), "windows-1250")
new String(resultSet.getBytes(1), "windows-1251")
new String(resultSet.getBytes(1), "windows-1256")
new String(resultSet.getBytes(1), "windows-1257")

Windows-1250 is for central and eastern Europe. Wikipedia says it can be used for Polish, Czech, Slovak, Hugarian, Slovene, Serbo-Creatian, Romanian, Albanian, and German text.

Windows-1251 is for Cyrillic languages. Wikipedia says is can be used for Russian, Ukrainian, and Belarusian, among others.

Windows-1256 is for Arabic languages.

Windows-1257 is for Estonian, Latvian, and Lithuanian.