Home > Back-end >  The compiling process is proprocess-compiling-assembling-linking. Where in the process that "A&
The compiling process is proprocess-compiling-assembling-linking. Where in the process that "A&

Time:10-15

#include <stdio.h>

int main(void) {
printf("A");
}

Does reserved keyword of the language int also follows the ASCII or Unicode character set. Like splitting it into individual character to convert it into binary?

CodePudding user response:

The character set of the source files is implementation-defined. Many common compiler systems use ASCII, some even Unicode, but the C standard requests no specific character set.

This is the first translation phase specified by the C standard:

  1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.

The chapter 5.2.1 of the C standard differentiates between the source character set and the execution character set. None of them is defined as a specific character set.

Chapter 5.1.1.2 defines several translation phases, of which the fourth phase executes all preprocessor directives, and as such finishes the preprocessing.

This is the fifth phase:

  1. Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.

So your title question can be answered as "The compiler converts characters from the source character set to the execution character set."

There are actually compilers with different character sets. For example, z88dk uses a target specific execution character set that is not necessarily ASCII, but accepts ASCII source files.


However, this conversion takes place only for character constants and strings literals.

Keywords are not affected. They are processed by the preprocessor in their encoding in the source character set and are never converted to any other character set. This source character set can be ASCII or UTF-8, but the specific character set is implementation-defined.

  1. The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). [...]

Such a token could be implemented by some integer constant or enumeration member, or any other means seen fit by the developers.


Concerning your mentioning of "binary": Anything in a computer as we know them commonly is binary. It is always the interpretation of a binary value that leads to a numeric value, a character, a machine instruction, or any other meaning. Therefore, there are less conversions to binary than you might think.

CodePudding user response:

Reserved words of programming languages are almost ever composed from plain ASCII characters.
int is stored in the source file as 0x696E74 and compiler has to parse and identify it as the one of reserved words. This can be done by comparing those three bytes with strings in the table of reserved words, or using some more advanced technique, such as hash lookup. Many languages have case-insensitive reserved words, in this case the parsed word int would have to be converted to uniform character case INT first.

Argument of function printf() is a literal string "A". Compiler will store its value into object file, typically to .rodata section, as the sequence of bytes 0x4100 (zero-terminated string in ASCII or UTF-8 encoding). There is no sense to speak of converting the ASCII value of letter A to binary - it is the question of interpreting the byte contents as letter A or its binary value 0x41=65.

  • Related