I am trying to simplify a rule that has left-recursion in the form of:
A → Aα | β
----------
A → βA'
A' → αA' | ε
The rule I have is:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
From my understanding of the formula, here is what my variables would be:
A = selectStmt
α = setOp selectStmt
β = simpleSelectStmt
A'= selectStmt' // for readability
Then, from application of the rule we get:
1. A → βA'
selectStmt → simpleSelectStmt selectStmt'
2. A' → αA' | ε
selectStmt' -> setOp selectStmt selectStmt' | ε
But then how do I simplify it further to get the final production? In a comment from my previous question at Removing this left-recursive way to define a SELECT statement, it was stated:
In our case a direct application it would take us from what we had originally:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
to
selectStmt: simpleSelectStmt selectStmt'
and
selectStmt': (setOp selectStmt) | empty
which simplifies to
selectStmt: simpleSelectStmt (setOp selectStmt)?
I don't get how that simplification works. Specifically:
How does selectStmt' -> setOp selectStmt selectStmt' | ε
simplify to selectStmt': (setOp selectStmt) | empty
? And how is the ε
removed here? I assume (setOp selectStmt) | empty
simplifies to (setOp selectStmt)?
because if it can be empty than it just means the optional ?
.
CodePudding user response:
Your starting point:
# I removed the parentheses from the following
selectStmt: selectStmt setOp selectStmt | simpleSelectStmt
is ambiguous. Left recursion elimination does not solve ambiguity; rather, it retains ambiguity. So it's not a bad idea to resolve the ambiguity first.
Real-world parser generators can resolve this kind of ambiguity by using operator precedence rules. Some parser generators require you to write out the precedence rules, but Antlr prefers to use a default set of precedence rules (using the order of the productions in the grammar, and assuming every operator to be left-associative unless otherwise declared). (I mention Antlr because you seem to be using it as a reference implementation, although its production semantics is a bit quirky; the implicit precedence rule is just one of the quirks.)
Translating operator precedence into precise BNF is a non-trivial endeavour. Parser generators tend to implement operator precedence by eliminating certain productions, either at grammar compile time (yacc/bison) or with runtime predicates (Antlr4 and most generators based on the shunting yard algorithm). Nevertheless, since operator precedence doesn't affect the context-free property, we know that there is a context-free grammar in which the ambiguity has been resolved. And in some cases, like this one, it's very easy to find.
This is essentially the same ambiguity as you find in arithmetic expressions; without some kind of precedence resolution, 1 2 3 4
is syntactically ambiguous, with five different parse trees. ((1 (2 (3 4)))
, (1 ((2 3) 4))
, ((1 2) (3 4))
, ((1 (2 3)) 4)
, (((1 2) 3) 4)
). As it happens, these are semantically identical, because addition is associative (in the mathematical sense). But with other operators, such as -
or /
, the different parses result in different semantics. (The semantics are also different if you use floating point arithmetic, which is not associative.)
So, in the same way as your grammar, the algebraic grammar which starts:
expr: expr ' ' expr
expr: expr '*' expr
is ambiguous; it leads precisely to the above ambiguity. The resolution is to say that
and most other algebraic operators are left associative. That results in an adjustment to the grammar:
expr: expr ' ' term | term
term: term '*' factor | factor
...
which is not ambiguous (but is still left recursive).
Note that if we had chosen to make those operators right associative, thereby producing the parse (1 (2 (3 4)))
, then the unambiguous grammar would be right-recursive:
expr: term ' ' expr | term
term: factor '*' term | factor
...
Since those particular operators are associative, so that it doesn't matter which syntactic binding we chose (as long as *
binds more tightly than
), we could bypass left-recursion elimination altogether, as long as those were the only operators we cared about. But, as noted above, there are lots of operators whose semantics are not so convenient.
It's worth stopping for a moment to understand why the unambiguous grammars are unambiguous. It shouldn't be hard to understand, and it's an important aspect of context-free grammars.
Take the production expr: expr
term
. Note that term
does not derive 2 3
; term
only allows multiplicative operators. So 1 2 3
can only be produced by reducing 1 2
to expr
and 3
to term
, leaving expr ' ' term
, which matches the production for expr
. Consequently, ((1 2) 3)
is the only possible parse. (1 (2 3))
could only be written with explicit parentheses.
Now, it's easy to do left-recursion elimination on expr: expr ' ' term
, or selectStmt: selectStmt setOp simpleSelectStmt | simpleSelectStmt
, to return to the question at hand. We proceed precisely as you indicate, except that α is setOp simpleSelectStmt
. We then get:
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
By back-substituting selectStmt
into the first production of selectStmt'
, we get
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp selectStmt
| ε
That's cool; it's not ambiguous, not left-recursive, and has no LL(1) conflicts. But it does not produce the same parse tree as the original. In fact, the parse tree is quite peculiar: S1 UNION S2 UNION S3
is parsed as (S1 (UNION S2 (UNION S3 ())))
.
Intriguingly, this is exactly the same place we would have gotten to had we used the right-associative grammar selectStmt: simpleSelectStmt setOp selectStmt | simpleSelectStmt
. That grammar is unambiguous and not left-recursive, but it's not LL(1) because both alternatives start with simpleSelectStmt
. So we need to left-factor, turning it into selectStmt: simpleSelectStmt (setop selectStmt | ε)
, exactly the same grammar as we ended up with from the left-recursive starting point.
But the left-recursive and right-recursive grammars really are different: one of them parses as ((S1 UNION S2) UNION S3)
and the other one as (S1 UNION (S2 UNION S3))
. With UNION
, we have the luxury of not caring, but that wouldn't be the case with a SET DIFFERENCE
operator, for example.
So the take-away: left-recursion elimination erases the difference between left and right associative operators, and that difference has to be put back using some non-grammatical feature (such as Antlr's run-time semantics). On the other hand, bottom-up parsers like Yacc/Bison, which do not require left-recursion elimination, can implement either parse without requiring any extra mechanism.
Anyway, let's go back to
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
It should be clear that selectStmt'
represents zero or more repetitions of setOp simpleSelectStmt
. (Try that with a pad of paper, deriving successively longer sentences, in order to convince yourself that it's true.)
So if we had a parser generator which implemented the Kleene *
operator (zero or more repetitions), we could write selectStmt'
as (setOp simpleSelectStmt)*
, making the final grammar
selectStmt: simpleSelectStmt (setOp simpleSelectStmt)*
That's no longer BNF --BNF does not have grouping, optionality, or repetition operators-- but in practical terms it's a lot easier to read, and if you're using Antlr or a similar parser generator, it's what you will inevitably write. (All the same, it still doesn't indicate whether setOp
binds to the left or to the right. So the convenience does come at a small price.)