competing regular expressions (race condition)-CodePudding

I'm trying to use python PLY (lex/yacc) to parse a language called 'GRBL'. GRBL looks something like this:

G00 X0.0 Y0.0 Z-1.0
G01 X1.0
..

The 'G' Codes tell a machine to 'go' (or move) and the coordinates say where.

LEX requires us to specify a unique regular expression for every possible 'token'.

So in this case I need a regex that will clearly define 'G00' and one that will clearly define 'G01' etc.

Obviously one's first thought would be r'G00' etc. However G code is imprecise. The G can be upper or lower case, there can be leading zeros etc.

(g0, G00, g001 etc.)

So something for G00 may be as simple as:

r'[Gg]{1}0*'

And for G01 we could have

r'[Gg]{1}0*1'

But this does not work. G00 parses correctly, but G01 gives:

LexToken(G00,'G0',3,21)
Illegal character '1'

That is, lex thinks that G01 is a G0 token and doesn't know what to do with the '1'. Which is clearly some sort of greedy matching problem.

Unfortunately I can't use the "$" terminator to specify that the string must 'end' with a "1"

I realise this might seem simple to some, but I've been at this for 3 hours and can't get it to work! Does anyone know how to address this problem?

CodePudding user response：

Your pattern [Gg]{1}0* is to general and it matches both G00 and G01 https://regex101.com/r/gkz0Wb/1 And in this second case you are left with single character 1.

You will have to make this pattern more specific for example by adding whitespace character at the end pattern [Gg]{1}0*\s https://regex101.com/r/dsSFaG/1

CodePudding user response：

Note: There's pretty well no reason to write {1} in a regular expression. It means that the previous element should be repeated exactly once, which is what would have happened without the repetition operator. So all it does is to obfuscate the regular expression (and slow down matching).

But that's not your problem. Your problem is likely the order in which Ply applies the regular expressions. Ply creates a single massive Python regular expression by concatenating all patterns into a set of alternatives:

(pattern1)|(pattern2)|(pattern3)|...|(patternz)

The order in which the patterns are inserted is important because Python "regular" expressions use an ordered alternation operator (making them actually irregular in mathematical terms, but that's a side issue). So once some alternative matches, the following ones are not even tried.

The Ply manual defines the ordering:

All tokens defined by functions are added in the same order as they appear in the lexer file.

Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).

I'm guessing that you're using functions, so that the patterns are in order by appearance in the file, because your second pattern --which is longer-- would by applied first if they were defined as strings. But without seeing your actual file, it's very hard to know for sure.

In any case, conventional wisdom for Ply lexers is to use as few patterns as possible, preferring to map keywords to tokens with dictionaries. In the case of GRBL one possibility might be to use [Gg][0-9] (\.[0-9]÷)? as the pattern and then extract the index in the semantic action.