Home > Net >  Character Encoding in C internally?
Character Encoding in C internally?

Time:10-10

If I create a string literal with the u8 prefix, does the machine code knows and says, that the corresponding value of that variable should be encoded in UTF-8?

So that no matter where I run the program, the computer knows how to encode it every time? Or does the machine code doesn't say, encode it like this and this?

Because if I encode something in normal char, and something in UTF-8 (e.g. with u8), then what is the difference and how does the computer know the encoding, if the machine code doesn't say anything about it?

CodePudding user response:

u8"..." strings are always encoded in UTF-8, as specified in [lex.string]/1.

The encoding of "..." strings depends on the compiler (and on the source file encoding), but it shouldn't be hard to configure your IDE to save files in UTF-8, and your compiler to not touch UTF-8 in plain string literals.

In any case, the encoding is handled entirely at compile-time. In the compiled code the strings are just sequences of bytes; there is no conversion between encodings at runtime, unless you explicitly call some function that does that.

CodePudding user response:

If I create a string literal with the u8 prefix, does the machine code knows and says, that the corresponding value of that variable should be encoded in UTF-8?

Machine code knows nothing. Compiler encodes the literal into UTF-8 and generate the correct sequence of bytes.

So that no matter where I run the program, the computer knows how to encode it every time? Or does the machine code doesn't say, encode it like this and this?

The sequence of bytes is then emitted at runtime and the output device that will receive this sequence will translate it correctly if it knows how to. That means that, for example, a console that accepts UTF-8 encoding will show correct chars, if not garbage is shown.

CodePudding user response:

Yes the character will almost certainly be encoded in UTF-8 but note that the standard doesn't require char8_t to be 8-bit, just that it needs to be capable of storing UTF-8 code units so some weird C runtime could use 16-bit characters with only 8-bits stored in each element.

Also note that char8_t is only able to store ASCII characters, all other characters require multiple code units so need to be stored in a char8_t string/array even if they are only a single character.

  •  Tags:  
  • c
  • Related