When I use JAVA 8,String is saved with char[],so if i write like follow
String test = "a";
i think a
is one element in char[],
as we know,char occupied 2byte in JAVA,so i think test.getBytes().length may be 2 but 1
String test = "a";
System.out.println(test.getBytes().length);
char c = 'c';
System.out.println(charToByte(c).length);
result is
1 2
letter occupied 1byte as we know,but a
is saved as one element in char[],char occupied 2byte
so i wonder where did i misunderstand
CodePudding user response:
From Oracle docs:
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
so long story short, char is a 2-byte encoding as it's using Unicode (but I assume that this might change from 1 implementation of JVM to another)
On the other hand, String can be stored in many encoding including UTF-8 (which is 1 byte), so depending on the JVM and how you created the string
CodePudding user response:
The basics on String
String
holds text as Unicode, and hence can combine Greek, Arabic and Korean in a single String.
The type char
holds 2 bytes, in the Unicode transfer format UTF-16. Many characters, symbols, Unicode code points will fit in 1 char
, but sometimes a pair of char
s is needed.
The conversion between text (String
) and binary data (byte]}
)
The binary data is always encoded in some Charset
. And there always is a conversion between them.
Charset charset = Charset.defaultCharset();
byte[] b = s.getBytes(cjarset);
String s = new String(b, charset);
The number of bytes a String occupies
The string "ruĝa"
contains 4 code points, symbols, glyphs.
It is stored in meory as 4 char
s of 2 bytes = 8 bytes (plus a small object implementing size).
It can be stored in binary data for some charset:
- in Latin-1 as "ru�a" or "ru?a" (limited feiled conversion)
- in full UTF-32 as 4x4 = 16 bytes
- in Latin-3 as "ruĝa" = 4 bytes
- in UTF-8 as "ruĝa" = 8 bytes
However recently String
may use instead of a char
array a byte
array, with a Charset, so it can save on memory. That relies on the actual content being a single byte encoding. You should not count on this, say for dynamic strings.
Answer
public static int bytesInMemory(String s) {
return s.getBytes(StandardCharsets.UTF_16).length;
}
Most code points, symbols, 2 bytes, some 4 bytes each.
And note that é
might be 2 or 4 bytes: one code point or two code points (basic letter e
and zero width accent). Vietnamese can even have two accents per letter, so 3 code points.