I have started learning javacc parser recently. I was requested to write a parser in which a token takes numbers from 1 to many and another token which take numbers from 2 to many Therefore I came up with something like this:
TOKEN : {
<NUM1: <NUM> (<NUM>)* > //for one or more
<NUM2: (<NUM>) > //for two or more
<NUM :(["0"-"9"]) >
// and in the function
void calc():
{}
{
(
(<NUM1>)
(<NUM2>)
)* <EOF>
}
However even if I pass a text value WITH no numbers, it is getting passed successfully. What am i doing wrong in this?
CodePudding user response:
The JavaCC syntax for the lexical tokens allows you to have repetitions of elements enclosed in a scope ()
followed by one of:
? - zero or one time
* - zero or more times
- one or more times
In your case, you need two tokens:
TOKEN:
{
<NUM2: ["0"-"9"] (["0"-"9"]) > // for two or more
<NUM1: (["0"-"9"]) > // for one or more
}
You read that as:
- token named
NUM1
matches from one to infinity number of digits - token named
NUM2
matches from two to infinity number of digits
The lexical machinery in JavaCC consumes one character from the input character stream and attempts to recognize a token. The two automata are as follows:
The lexer progresses simultaneously in the both automata for the both tokens. After no more progress is possible the latest found token is recognized. If more the one token type is possible, then the one declared first is recognized. For this reason NUM2
is declared before NUM1
. This means that for input 1
the token NUM2
is not going to be recognized, because more then one digit is needed for it. In this case NUM1
is going to be the only one token type that matches this input. For input 12
both token types are going to accept it, but NUM2
is going to be recognized, because its declared first. This means that if you order them NUM1
first, then NUM2
you are never going to receive NUM2
token, because NUM1
will always "win" with its highest precedence.
To use them, you can have two parser functions like these:
void match_one_to_many_numbers() : {} { <NUM1> (" " <NUM1>)* <EOF> }
void match_two_to_many_numbers() : {} { <NUM2> (" " <NUM2>)* <EOF> }
You read that as:
- function
match_one_to_many_numbers
accepts from one to infinity number of tokenNUM1
, delimited by a space, and then the input stream have to end with theEOF
system token - function
match_two_to_many_numbers
the same, just from tokenNUM2
Because both of the tokens accept infinite number of digits, you cannot have a sequence of these tokens without a delimiter that is not a digit.