Home > Mobile >  Why does the C standard state that string literals shall begin and end in the initial shift state?
Why does the C standard state that string literals shall begin and end in the initial shift state?

Time:06-02

The ANSI X3.159-1989 "Programming Language C" standard states in the chapter "5.2.1.2 - Multibyte characters" that:

For the source character set, the following shall hold:

  • A comment, string literal, character constant, or header name shall begin and end in the initial shift state.
  1. Does it mean that a string literal or etc. shall begin and end with a character, represented by a value of the initial shift state, i.e. a single-byte value? Or does it mean that the environment shall reset it's current shift state to the initial shift state before and after processing a certain string literal or etc?

  2. Why so? - I.e. what is the purpose to set the initial shift state, especially at the end of a string literal or etc?

CodePudding user response:

Why does the C standard state that string literals shall begin and end in the initial shift state?

Let's first see what exactly (or as exactly as the specification gets) is meant by "shift state":

A multibyte character may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other implementation-defined shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.

Requiring string literals to begin and end in the initial shift state makes string semantics a lot simpler and more predictable. If you concatenate two strings, or output them one after the other, you can be confident that their juxtaposition does not change the meaning of the latter. If the first could terminate in a shift state different from the initial one then that would not be guaranteed.

The inherent assumption underlying all this is that language-level semantics are ignorant of the details of any particular character encoding. They treat all strings as black boxes of bytes, terminated by a null character.

  1. Does it mean that a string literal or etc. shall begin and end with a character, represented by a value of the initial shift state, i.e. a single-byte value? Or does it mean that the environment shall reset it's current shift state to the initial shift state before and after processing a certain string literal or etc?

Neither. With a state-dependent encoding, the current shift state is a running property of an interpretation of an encoded character sequence. Characters do not necessarily encode shift states directly, but the encoding scheme provides a way to specify shift state changes.

Details can vary with the particular encoding scheme, but encoded characters, whether single- or multibyte, are not generally in a particular shift state inherently. The whole point of such an encoding is that the same subsequence may be interpreted differently depending on the shift state. Thus, starting in the initial shift state is an assertion about how multibyte character sequences will be interpreted, and only by implication a statement about what a string literal must contain.

Ending in the initial shift state, on the other hand, is a constraint on the contents of the string, etc. A C source file is malformed if the bytes of a string literal etc. within, interpreted as starting in the initial shift state, encode one or more state shifts such that the shift state at the end of the byte sequence is different from the initial shift state. This is exactly to relieve the implementation from having to be concerned with encoding issues, and absolutely not to require it to perform any kind of shift-state cleanup.

  1. Why so? - I.e. what is the purpose to set the initial shift state, especially at the end of a string literal or etc?

It simplifies the language and improves the maintainability of C source files written in state-dependent source encodings. Each unit that accepts user-defined free(ish) text is modular -- it has the same, well-defined meaning regardless of the surrounding context, and moving, copying, or deleting such units cannot change the lexical interpretation of the surrounding tokens.

  • Related