In the reference section of the manual the word production is mentioned:
This manual describes the C language specified by the draft submitted to ANSI on 31 October, 1988, for approval as ``American Standard for Information Systems - programming Language C, X3.159-1989.'' The manual is an interpretation of the proposed standard, not the standard itself, although care has been taken to make it a reliable guide to the language. For the most part, this document follows the broad outline of the standard, which in turn follows that of the first edition of this book, although the organization differs in detail. Except for renaming a few productions, and not formalizing the definitions of the lexical tokens or the preprocessor, the grammar given here for the language proper is equivalent to that of the standard.
I can't however seem to make out what it actually means. My guess is that it refers to the value of a key in the grammar section:
translation-unit: external-declaration translation-unit external-declaration
The following piece should hint at what the writer means but I can't seem to figure it out by myself as I miss some essential terminology:
The grammar has undefined terminal symbols integer-constant, character-constant, floating- constant, identifier, string, and enumeration-constant; the typewriter style words and symbols are terminals given literally. This grammar can be transformed mechanically into input acceptable for an automatic parser-generator. Besides adding whatever syntactic marking is used to indicate alternatives in productions, it is necessary to expand the ``one of'' constructions, and (depending on the rules of the parser-generator) to duplicate each production with an opt symbol, once with the symbol and once without. With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator. It has only one conflict, generated by the if-else ambiguity.
I don't know either what the embolden words/sentences mean. Can someone give me an insight?
CodePudding user response:
In computer science, formal grammars are described using production rules. For example, a rule of the form:
P → xyz
says the symbol P may be replaced by the symbols xyz, and a rule of the form
Q → aP | bP
says the symbol Q may be replaced by the symbol a and the symbol P or by the symbol b and the symbol P.
The formal grammar designates a start symbol, such as Q, and says which symbols are nonterminal symbols (placeholders that do not appear in the final string) and which are terminal symbols (that may appear in the final string). If Q is the start symbol for the grammar with the two rules above, then these rules describe a grammar that contains two strings, axyz and bxyz, because those are the only two strings that can be made with those rules. The set of strings a grammar can produce is called its language.
λ is commonly used the mean an empty string, one with no symbols.
This grammar with nonterminal symbol S, terminal symbols a and b, and start symbol S:
S → aSbb | λ
can produce the empty string by replacing S with the empty string, or it can produce abb by replacing S with aSbb and then the new S with the empty string, or it can produce aabbbb by replacing S with aSbb and then replacing the new S with aSbb, yielding aaSbbbb, and then replacing S with the empty string, yielding aabbbb. We can see the language of this grammar is the set of all strings containing some number of a characters (possibly zero) followed by twice as many b characters.
The C standard defines the C language using a formal grammar (with some qualifications and modifications in English). For example, it contains this production (where it uses “:” instead of the “→” I used above and uses separate lines instead of “|” to indicate a choice):
statement:
labeled-statement
compound-statement
expression-statement
selection-statement
iteration-statement
jump-statement
Regarding the other terminology you marked in bold:
undefined terminal symbols integer-constant,…
The integer-constant symbol is a defined non-terminal in the official ANSI C 1990 standard. Referring to it as an undefined terminal symbol suggests it was not defined in the submitted draft. I do not have a copy of the draft to check.
terminals given literally
I am not sure what this means. Terminal symbols generally are literal; a production that is said to produce e actually produces the character “e”, with the exception that we often use “λ” to mean the empty string.
to duplicate each production with an opt symbol, once with the symbol and once without
The standard uses a subscript “opt”, like this:
P: Q Ropt
to mean that R is optional. Formally, this is actually two rules:
P → Q R
P → Q
meaning there is a rule that replaces P with Q R and a rule that replaces P with Q, and either may be used. Hence, the R is optional. Using the “opt” subscript is just a different way of writing the rules.
… this grammar is acceptable to the YACC parser-generator. It has only one conflict, generated by the if-else ambiguity.
In computer science theory, a parser is software that checks whether a string is in a language. (In practice, it often does other useful things at the same time, like assigning “meaning” to the string by interpreting parts of it as numbers, parts as variables, and so on.) YACC is software that reads a specification of a grammar and produces source code for a parser for the language the grammar describes.
A conflict occurs when one string can be produced in two different ways by the grammar. In the C formal grammar, if (x) if (y) S1 else S2
can be produced in two ways, one that associates else S2
with the first if
and one that associates else S2
with the second if
. This has to be resolved by adding information to the grammar; YACC is told to associate the else S2
with the most recent if
that does not currently have an else
.
CodePudding user response:
A "symbol" is an element of a grammar.
E.g. in the section you quoted:
translation-unit:
external-declaration translation-unit external-declaration
All four are symbols.
A "production" explains what a symbol consists of. Your quote is a single production that states that a translation-unit consists of three things in order: external-declaration, then translation-unit, then external-declaration.
A "terminal" symbol is a symbol that doesn't have any other symbols in its definition.
E.g. your translation-unit
is not terminal, but this one is:
digit: one of
0
1
2
3
4
5
6
7
8
9
It's moot if the individual digits are terminals, or if the whole "digit" is terminal; probably the former.
Those are also "given literally", meaning the digits are actually spelled in the standard.
And here's a terminal not given literally:
basic-c-char:
any member of the translation character set except the U 0027 apostrophe, U 005c reverse solidus, or new-line character
Notice how they didn't spell every possible character, since there are too many of them, and instead provided a plain text description.
An "undefined" terminal is just a terminal that wasn't explained/defined, making the grammar technically incomplete. If they call "integer-constant" an undefined terminal, it means their grammar fails to include integer-constant: blah blah
.
to duplicate each production with an opt symbol, once with the symbol and once without
Consider:
foo:
x
opty
Here foo is either x
or x y
.
They're suggesting an alternative way to spell the same thing, without using "opt":
foo:
y
x
y
It has only one conflict, generated by the if-else ambiguity.
Parsing a program is done by starting with a single symbol, and figuring out which productions transform it to the text you're trying to parse.
If there's more than one sequence of productions that does it, the grammar is "ambiguous".