Background
I've made a couple simple compilers before, but I've never properly addressed this issue:
Say I have a token LT
which searches the expression <
and a token LTEQ
which searches <=
.
A LT
would match part of <=
in this case, and I don't want this.
Other potential examples are:
===
vs==
vs=
=>
vs>
vs=
vs>=
Before, what I've done is as follows:
- Order the regular expressions from most specific to least specific
- For each expression match, replace it in the line with an equal-length series of whitespace characters and record the token with its start index (Replacing
||
:"true || false"
=>"true false"
) - Sort the list of tokens by their start index
This is inefficient and only works in regex libraries that allow for a lambda to be passed in to process matches. Naturally I want to improve this.
The problem
How do I tokenize a string with tokens definitions that overlap as show with LT
and LTEQ
?
Possible solutions I've thought of but not tried:
Keeping a map of all used spans of characters and ignore matches within these spans Only having very simple token definitions and creating more complicated patterns from the simple ones
LTEQ
<=
would be created fromLT
<
andEQ
=
CodePudding user response:
You should try to tokenise the input in one single sweep. You would create a "monster" regex that would match any token (without regarding context unless absolutely necessary -- that would come later), giving precedence to larger tokens over smaller ones.
Here is an example with a very basic regex just to demonstrate the idea. It is implemented in JavaScript, and it parses a tiny Java-like input:
let input = `
for (int i = 0; i < 1e 4; i ) {
int k = i <= n / 2 ? 9.42 : 3.14;
System.out.println(k);
}
`;
const regex = /[ -]?\d (\.\d*)?(e[- ]?\d )?|\w |[<>=]=?|\ [ =]?|-[-=]?|\S/g;
const tokens = input.match(regex);
console.log(tokens);
The final \S
is the "anything else" match, so that this is guaranteed to capture all non-space text. Following this basic tokenisation you'd then possibly create an abstract syntax tree from it.