Byte arrays and strings in java-CodePudding

        byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
        String s= new String(arr);
        Arrays.equals(arr, s.getBytes()));  // returns false

Why are the arrays not equal? I would expect getBytes() to return the original byte array.

CodePudding user response：

It depends on your Charset.defaultCharset(). That determines how the bytes are interpreted. Probably the negative values are a non-canonical way of representing codepoints.

(see this great answer: https://stackoverflow.com/a/7934397/461499)

Re-interpreting the getBytes() to a String will then be the canonical way and will return true

    System.out.println(Charset.defaultCharset()); //UTF-8 here :)

    byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
    String s= new String(arr);
    System.out.println(s);
    // [56, 99, 87, 77, 73, 90, 105, -17, -65, -67, -52, -85, -17, -65, -67, -55, -115, 11, -17, -65, -67, -17, -65, -67]

    byte arr2[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -17, -65, -67, -52, -85, -17, -65, -67, -55, -115, 11, -17, -65, -67, -17, -65, -67};
    System.out.println(Arrays.toString(s.getBytes()));  
    System.out.println(Arrays.equals(arr, s.getBytes()));  // returns false

    String s2= new String(arr2);
    System.out.println(Arrays.toString(s2.getBytes()));
    System.out.println(Arrays.equals(arr2, s2.getBytes()));  // returns true

CodePudding user response：

You seem to think that bytes and characters are interchangible.

They simply are not.

To turn characters into bytes, you 'encode' the characters using a 'charset encoding'. To turn bytes back into characters, you decode them using a 'charset encoding'. There is no such thing as converting one to the other without a charset encoding.

The transition bytes->chars->bytes is only 'perfect' (guaranteed to always give you the same byte array back) for a select few encoding systems. Most encoding systems do not have this property. An encoding system that does, is ISO-8859-1. However, the 2 most common encodings do not have this property: Neither UTF-8 nor US-ASCII gets the job done.

The methods you use here (both str.getBytes as well as new String(byteArr)) use the 'platform default encoding'. Starting with JDK18, that's guaranteed to be UTF-8 (thus guaranteeing that this will not work properly), and before that, it's whatever your system's default encoding is, which we don't know.

US-ASCII doesn't work because US_ASCII only defines a subset of all bytes as 'valid': 0-126. Most of your bytes (all of em with a minus sign) aren't valid ASCII.

UTF-8 doesn't work because not all byte sequences are valid UTF-8. In other words, there are sequences of bytes that simply cannot be produced with UTF_8.

More to the point though, the entire principle is just broken. Even if you know it's ISO-8859-1, what are you trying to accomplish by doing this? You may be able to translate an arbitrary byte array into ISO-8859-1 and back again without losing anything, but what point does this serve? You can easily produce strings that cause havoc, with NUL characters, tabs, backspaces, 'bell' sounds, and other bizarreness. It's a string you'd never ever want to print. Which asks the question: Why do you want one, then?

There really is only one sensible answer to that question, and that is: I wish to transport these bytes through a medium that only supports strings. For example, I have some raw bytes, and I want to put them in an email, or in a form field for a jira ticket or something silly like that, and an attachment is for some reason not an option in this. Or I want to stuff it into a URL (https://www.foo.bar/?q=raw-bytes-here).

There are 2 answers to doing that, and neither involve new String(byteArr):

Nibbles

Any raw byte can trivially be turned into hexadecimal representation: 255 (or -1, in signed byte form, it's the same thing) turns into FF. 1 turns into 01 - all bytes are always exactly 2 characters in length. You can use:

byte f = -1;
String nibbled = String.format("X", (int) f);
System.out.println(nibbled); // prints 'FF'

The individual letter/digit (0-9A-F. Technically that's just a digit, in hexadecimal, where A-F are also digits) is called a 'nibble' (because it's half a byte, see. Boy, the 60s when these terms were invented were a hoot weren't they).

This is somewhat inefficient; a byte array of X bytes turns into a string of 2*X characters (and each character may well take 2 bytes, e.g. if it is UTF-16 encoded, for a total of 25% efficiency, ouch). But it is trivially readable and common. It's great for short (sub 500 or so bytes) byte arrays.

Another advantage is that you can eyeball the string and know what the data is, if you can read hexadecimal and, if signed is relevant, 2's complement, which is not too difficult.

Base64

Base64 is a simple encoding scheme that defines 64 'safe' characters that you know will safely 'survive' without getting mangled or misinterpreted. That gives you 6 bits of data per character. Bytes are 8, so, you can 'stuff' 3 bytes into 4 characters this way; for example a 900 byte array turns into 1200 characters.

Java has base64 encoding/decoding built in.

byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
String s = Base64.getEncoder().encodeToString(arr);
// s is all ASCII chars and safe to include just about everywhere.
// URL parameter, emails, web forms, you name it.
byte[] arr2 = Base64.getDecoder().decode(s);
Arrays.equals(arr, arr2); // true, guaranteed.

Base64 is slightly more complicated, and you can no longer eyeball a base64 string and just see the bytes matrix-style. But it is more efficient than nibble form: 75% efficiency (or 37.5% if the underlying characters take 2 bytes per char, i.e. with UTF-16).

CodePudding user response：

The following constructor will read the byte array and decode it and according to the default charset.

new String(arr);

So when you do

String s= new String(arr);
 s.getBytes()

the bytes() returns the array again as was previously decoded according to the default charset.

If you inspect with debugger you can see how the new String(byte []) method works for UTF-8 default charSet. You will see that the byte {-127} is decoded into {-17, -65, -67} because -127 as byte is not valid for Utf-8. So {-127} is decoded into {-17, -65, -67} because this represents the Replacement character of Utf-8 -> �.

Actually any element or sequence of elements of byte array that can't be matched as a valid Utf-8 character when this is the default charset, it is then converted into {-17, -65, -67} which is the representation for the �.

In your example the following bytes {-9, -127, -23} are non valid for Utf-8 charset. So the previous array of the 3 elements is converted into �� which in bytes array again is represented from {-17, -65, -67, -17, -65, -67, -17, -65, -67}

So by removing the non valid Utf-8 bytes -9 , -127, -23 from your example will return true for default charset Utf-8 as all of your remaining bytes can be decoded by Utf-8

        byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -52, -85, -55, -115, 11};
        String s= new String(arr);
        System.out.println(Arrays.equals(arr, s.getBytes())); //prints true

This indicates that when you create a String from a byte array the original byte array will be decoded into a new byte array according to the charset. So you can't expect with string.getBytes() to retrieve the original byte array if some of the provided byte elements are non valid according to the related charset.

So in the end we can sum up into the following:

Your code will always return true, as far as all provided elements/sequence of elements in your byte array can be decoded by the underlying charset that JVM uses when it executes your code.