Home > Blockchain >  Transforming grammar for predictive parsing
Transforming grammar for predictive parsing

Time:04-22

Is the following grammar suitable for predictive parsing, or is their an algorithm to modify the grammar to make it suitable for predictive parsing?

number = digit digit_or_sep* digit | digit
digit_or_sep = '0'..'9' | '_'
digit = '0'..'9'

where * means zero or more, and | divides the different choices.

I have written a backtracking parser that works fine on the above grammar, however I've read that modern parsers are mostly predictive parsers these days and backtracking parsers are rarely used. Backtracking parsers need to rewind the state of the parser as they backtrack which makes them less performant than predictive parsers.

Transforming the grammar above into:

number = digit (sep* digit )*
sep = '_'
digit = '0'..'9'

where means one or more.

Will make predictive parsing work because it avoids digit_or_sep* from consuming too many tokens before the final digit. But I am not sure if there is an algorithmic way to auto-transform the grammar to make it work.

Edit: I had a read about left factoring on google, it could be the missing piece of the puzzle I need to understand.

After a bit of left refactoring:

number = digit digit_or_sep* digit | digit
digit_or_sep = '0'..'9' | '_'
digit = '0'..'9'
let g = digit_or_sep g | empty
number = digit g digit | digit
let h = g digit | empty
number = digit h
h = g digit | empty
g = digit_or_sep g | empty
g = digit g | sep g | empty
h = (digit g | sep g | empty) digit | empty
h = digit g digit | sep g digit | digit | empty
h = digit h | sep g digit | empty

The following grammar is produced:

number = digit h
h = digit h | sep g digit | empty
g = digit g | sep g | empty

Which should be suitable for a predictive parser. But I have still not come up with an algorithm to do it yet automatically.

Edit: Throwing in more details of what I am trying to do. The grammar above is just an small example of grammar I'd like to transform.

I am actually parsing Kotlin: https://github.com/clinuxrulz/parse-bolt/blob/main/src/kotlin/parser.rs

And it is working fine with the backtracking parser, but it is having performance issues. Parsing a simple function type signature "(a: String, b: String) -> String" took 20ms to parse which is way too slow for such a small input. There are some optimisations I can do in the code that I know off, but it was way slower than I expected.

On the other hand for predictive parsing I can not simply use their grammar as is. It seems it will need some manual changes to the grammar first. ANTLR must be doing some transformations on their grammar 1st before generating the parser.

Down the bottom of this grammar file for Kotlin is the same the number example above: https://github.com/Kotlin/kotlin-spec/blob/release/grammar/src/main/antlr/KotlinLexer.g4

I've read ANTLR can parse the full Java standard library in under 1 second. I might have a long way to go.

CodePudding user response:

I'm going with rici's comment that there is no algorithm to guarantee full left factoring of any grammar. However there are heuristic search methods.

I'm gonna stick to PEG and modify my own grammars when I need the performance. And optimise my backtracking parsing combinator as much as possible as it can handle more types of grammars and might still be useful.

  • Related