Home > Software design >  Avoiding overlap with similar regex patterns during tokenization
Avoiding overlap with similar regex patterns during tokenization

Time:10-17

Background

I've made a couple simple compilers before, but I've never properly addressed this issue:

Say I have a token LT which searches the expression < and a token LTEQ which searches <=. A LT would match part of <= in this case, and I don't want this.

Other potential examples are:

  • === vs == vs =
  • => vs > vs = vs >=

Before, what I've done is as follows:

  • Order the regular expressions from most specific to least specific
  • For each expression match, replace it in the line with an equal-length series of whitespace characters and record the token with its start index (Replacing ||: "true || false" => "true false")
  • Sort the list of tokens by their start index

This is inefficient and only works in regex libraries that allow for a lambda to be passed in to process matches. Naturally I want to improve this.

The problem

How do I tokenize a string with tokens definitions that overlap as show with LT and LTEQ?

Possible solutions I've thought of but not tried:

Keeping a map of all used spans of characters and ignore matches within these spans Only having very simple token definitions and creating more complicated patterns from the simple ones

  • LTEQ <= would be created from LT < and EQ =

CodePudding user response:

You should try to tokenise the input in one single sweep. You would create a "monster" regex that would match any token (without regarding context unless absolutely necessary -- that would come later), giving precedence to larger tokens over smaller ones.

Here is an example with a very basic regex just to demonstrate the idea. It is implemented in JavaScript, and it parses a tiny Java-like input:

let input = `
   for (int i = 0; i < 1e 4; i  ) {
       int k = i <= n / 2 ? 9.42 : 3.14;
       System.out.println(k);
   }
`;

const regex = /[ -]?\d (\.\d*)?(e[- ]?\d )?|\w |[<>=]=?|\ [ =]?|-[-=]?|\S/g;
const tokens = input.match(regex);
console.log(tokens);

The final \S is the "anything else" match, so that this is guaranteed to capture all non-space text. Following this basic tokenisation you'd then possibly create an abstract syntax tree from it.

  • Related