What's the number of bytes does `char` occupied in JAVA-CodePudding

When I use JAVA 8,String is saved with char[],so if i write like follow String test = "a"; i think a is one element in char[], as we know,char occupied 2byte in JAVA,so i think test.getBytes().length may be 2 but 1

String test = "a";
System.out.println(test.getBytes().length);
char c = 'c';
System.out.println(charToByte(c).length);

result is

1 2

letter occupied 1byte as we know,but a is saved as one element in char[],char occupied 2byte so i wonder where did i misunderstand

CodePudding user response：

From Oracle docs:

char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

so long story short, char is a 2-byte encoding as it's using Unicode (but I assume that this might change from 1 implementation of JVM to another)

On the other hand, String can be stored in many encoding including UTF-8 (which is 1 byte), so depending on the JVM and how you created the string

CodePudding user response：

The basics on String

String holds text as Unicode, and hence can combine Greek, Arabic and Korean in a single String.

The type char holds 2 bytes, in the Unicode transfer format UTF-16. Many characters, symbols, Unicode code points will fit in 1 char, but sometimes a pair of chars is needed.

The conversion between text (`String`) and binary data (`byte]}`)

The binary data is always encoded in some Charset. And there always is a conversion between them.

Charset charset = Charset.defaultCharset();
byte[] b =  s.getBytes(cjarset);
String s = new String(b, charset);

The number of bytes a String occupies

The string "ruĝa" contains 4 code points, symbols, glyphs. It is stored in meory as 4 chars of 2 bytes = 8 bytes (plus a small object implementing size).

It can be stored in binary data for some charset:

in Latin-1 as "ru�a" or "ru?a" (limited feiled conversion)
in full UTF-32 as 4x4 = 16 bytes
in Latin-3 as "ruĝa" = 4 bytes
in UTF-8 as "ruĝa" = 8 bytes

However recently String may use instead of a char array a byte array, with a Charset, so it can save on memory. That relies on the actual content being a single byte encoding. You should not count on this, say for dynamic strings.

Answer

public static int bytesInMemory(String s) {
    return s.getBytes(StandardCharsets.UTF_16).length;
}

Most code points, symbols, 2 bytes, some 4 bytes each.

And note that é might be 2 or 4 bytes: one code point or two code points (basic letter e and zero width accent). Vietnamese can even have two accents per letter, so 3 code points.

result is

1 2

The basics on String

The conversion between text (String) and binary data (byte]})

The number of bytes a String occupies

Answer

The conversion between text (`String`) and binary data (`byte]}`)