Home > Enterprise >  Can the "null character" be represented as a multibyte value in C language?
Can the "null character" be represented as a multibyte value in C language?

Time:06-03

The ANSI X3.159-1989 "Programming Language C" standard states in the chapter "5.2.1.2 - Multibyte characters" that:

For both [source and execution] character sets the following shall hold:

  • A byte with all bits zero shall be interpreted as a null character independent of shift state.
  • A byte with all bits zero shall not occur in the second or subsequent bytes of a multibyte character.

Does it mean that for the translation and execution environments next statements are true?:

  1. Both source and execution character sets might have a multibyte value, used to represent the null character, for each different shift state. [Thoughts: if the translation or execution environment can switch between different shift states (that can differ the number of bytes used to represent a character), then it should somehow detect the null character - not only as the one byte "null character" from the basic character set, but as, for example, a two byte "null character" for a particular shift state.] P.S. that might be a misconception of how character values are being interpreted in a string literal and etc. by translation and execution environment.
  2. Those characters can be represent only as a values with the first byte set to "0" [i.e. first byte with all bits zero], so there is a wide range of how to represent it: "FFFF 0000", "ABCD 0000" and etc.
  3. The "null character" is defined only in the basic execution character set. Both rules in a quote below are applicable to both extended translation and execution character sets. So that, multibyte representation of the "null character" can be in both translation and execution environment, and it's possible to use the multibyte "null character" in source code without the use of escape-sequences, but instead writing that character directly in some kind of literal.

Or the "null character" can only be represent as a single byte value, and its one and only such character, defined by the basic execution character set?

CodePudding user response:

Does it mean that for the translation and execution environments next statements are true?:

Both source and execution character sets might have a multibyte value, used to represent the null character, for each different shift state.

No. "null character" is a defined term:

A byte with all bits set to 0, called the null character, shall exist in the basic execution character set [...]

In the current standard (C17) that's in paragraph 5.2.1/2, but identical text goes all the way back to C89.

The point of the provisions quoted in the question is that C implementations don't have to care about shift state or extended characters to recognize null characters, and that using a null character as a string terminator does not cause truncation of any multibyte character.

Those characters can be represent only as a values with the first byte set to "0" [i.e. first byte with all bits zero], so there is a wide range of how to represent it: "FFFF 0000", "ABCD 0000" and etc.

No. Again, for the purposes of the language spec, "null character" is a defined term meaning a byte with value 0. The point of the provisions under discussion is that implementations don't need to consider any broader context when attempting to identify a null character. For example, string functions such as strcpy() and strlen() don't need to know or care anything about character encoding, shift state, or multibyte characters. They just recognize the end of the string by a null character.

The "null character" is defined only in the basic execution character set.

The C specification does not require the source character set to have a null character, but the text you quoted says that if it includes a single-byte character with value 0, then that character is a null character for C's purposes.

Both rules in a quote below are applicable to both extended translation and execution character sets.

Yes.

So that, multibyte representation of the "null character" can be in both translation and execution environment, [...]

No. Again, a null character is a byte with value 0, regardless of character set or encoding.

Or the "null character" can only be represent as a single byte value, and its one and only such character, defined by the basic execution character set?

There can be a null character in the source character set, too, though it is not required. And every extended character set embeds the corresponding basic character set, so in that sense every extended execution character set defines the null character, and extended source character sets may also do. However, in every character set that includes the null character, that character is represented as a byte with value zero, and in every character set that contains a byte with value zero in any character representation, that byte represents the null character.

  • Related