I am constructing an LL(1) parser (Recursive Descent Parser) and I need to parse the sentence a = 3
, I have two procedures to match that rule: parse_assignment
and parse_binary_operator
,
each function consumes the first token, and then it needs to see whether the operator (=
) matches the rule, so I call firstly parse_assignment
which consumes the token "a", and then consumes "=", and then consumes "3". In case the function failed (didn't match), I try to match the next rule.
but what if I want to parse the sentence a - 1
?
I try to call parse_assignment
but then it consumes a
and fails, and then I try to call parse_binary_operator
but the token "a" was already consumed so that function only has"-1
" to parse now.
So I need to somehow return the token to the stream in case the function didn't match, but I don't think returning the token back to the stream is a good idea, I guess there are more stable techniques to do that.
CodePudding user response:
The "(1)" in "LL(1)" indicates that the parser is allowed to consult (1) unconsumed token in order to decide which action to take. This token is usually called the "lookahead" token, because the parser can look ahead at it without actually consuming it. Whatever interface you implement between your parser and your lexer must somehow implement this feature; otherwise, you will run into precisely the problem you mention.
The precise nature of this interface will depend on which language you are using to write your parser, so it's hard to give a concrete answer. But basically, the lexical analyser will have two fundamental methods:
peek
, which returns the token type of the lookahead token. (A token usually has a number of additional features, all of which should be available in some way throughpeek
, or other methods which are also views of the lookahead token. These include the semantic value of the token, if any, and some indication of where the token is in the source file.consume
, which discards the lookahead token and replacing it with the next input token.
There are other possible interfaces, which are essentially equivalent. For example, it's common to implement a boolean convenience function like "match an X and consume the token on success", although that might not be so convenient if you also need to return the semantic value of the matched token.
The lookahead token is part of the state of the lexical analyser, which also includes information like the string currently being parsed and the current input position in that string. It was once common to store this state in global variables (which is what the lexical analysers built with the assistance of (f)lex do), but this makes it much harder to write a program with two different lexical analysers. Modern programming style discourages the use of global variables, which leaves the necessity to include the lexical analyser's state (or object) in the parser state (or object), or to explicitly pass a reference to the lexical analyser's state through the recursive parsing calls.