Home > other >  What's the number of bytes does `char` occupied in JAVA
What's the number of bytes does `char` occupied in JAVA

Time:11-24

When I use JAVA 8,String is saved with char[],so if i write like follow String test = "a"; i think a is one element in char[], as we know,char occupied 2byte in JAVA,so i think test.getBytes().length may be 2 but 1

String test = "a";
System.out.println(test.getBytes().length);
char c = 'c';
System.out.println(charToByte(c).length);

result is

1 2

letter occupied 1byte as we know,but a is saved as one element in char[],char occupied 2byte so i wonder where did i misunderstand

CodePudding user response:

From Oracle docs:

char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

so long story short, char is a 2-byte encoding as it's using Unicode (but I assume that this might change from 1 implementation of JVM to another)

On the other hand, String can be stored in many encoding including UTF-8 (which is 1 byte), so depending on the JVM and how you created the string

CodePudding user response:

The basics on String

String holds text as Unicode, and hence can combine Greek, Arabic and Korean in a single String.

The type char holds 2 bytes, in the Unicode transfer format UTF-16. Many characters, symbols, Unicode code points will fit in 1 char, but sometimes a pair of chars is needed.

The conversion between text (String) and binary data (byte]})

The binary data is always encoded in some Charset. And there always is a conversion between them.

Charset charset = Charset.defaultCharset();
byte[] b =  s.getBytes(cjarset);
String s = new String(b, charset);

The number of bytes a String occupies

The string "ruĝa" contains 4 code points, symbols, glyphs. It is stored in meory as 4 chars of 2 bytes = 8 bytes (plus a small object implementing size).

It can be stored in binary data for some charset:

  • in Latin-1 as "ru�a" or "ru?a" (limited feiled conversion)
  • in full UTF-32 as 4x4 = 16 bytes
  • in Latin-3 as "ruĝa" = 4 bytes
  • in UTF-8 as "ruĝa" = 8 bytes

However recently String may use instead of a char array a byte array, with a Charset, so it can save on memory. That relies on the actual content being a single byte encoding. You should not count on this, say for dynamic strings.

Answer

public static int bytesInMemory(String s) {
    return s.getBytes(StandardCharsets.UTF_16).length;
}

Most code points, symbols, 2 bytes, some 4 bytes each.

And note that é might be 2 or 4 bytes: one code point or two code points (basic letter e and zero width accent). Vietnamese can even have two accents per letter, so 3 code points.

  • Related