Home > OS >  Differences in the two different parsing methodologies
Differences in the two different parsing methodologies

Time:08-21

I am quite new to Antlr4, and am interested in simplifying a large grammar that I have. There are two possible approaches that I may take. The first one, is more of the "in-line" approach, which often makes things simpler as I run into less left-recursion errors (and everything is 'there in one place' so it it's a bit easier to track. Here is an example parse with it:

// inline
// input: CASE WHEN 1 THEN 1 WHEN 1 THEN 1 ELSE 1 END
grammar DBParser;

statement: expr EOF;
expr
    : 'CASE' expr? ('WHEN' expr 'THEN' expr)  ('ELSE' expr)? 'END'  # caseExpressionInline
    | ATOM                                                          # constantExpression
    ;

ATOM: [a-z]  | [0-9] ;
WHITESPACE: [ \t\r\n] -> skip;

enter image description here

The second approach is to break everything out into separate small components, for example:

// separated
grammar DBParser;

statement: expr EOF;
expr
    : case_expr     # caseExpression
    | ATOM          # constantExpression
    ;

case_expr
   : 'CASE' case_arg? when_clause_list case_default? 'END'
   ;

when_clause_list
   : when_clause 
   ;

when_clause
   : 'WHEN' expr 'THEN' expr
   ;

case_default
   : 'ELSE' expr
   ;

case_arg
   : expr
   ;

ATOM: [a-z]  | [0-9] ;
WHITESPACE: [ \t\r\n] -> skip;

enter image description here

A couple high-level things I've noticed are:

  1. The inline version comes out to about half the size of the output code files of the second approach.
  2. The in-line version is about 20% faster! I tried both parsers on a 10MB file of repetitions of the same statement, and the inline version never took over 1s and the separated version took around 1.2s and never went under 1.1s (I ran each about 20x). That seems crazy to me that just in-lining the expressions would have such a large performance difference! That was certainly not my goal but noticed it as I was comparing the two different things in writing up this question.

What would be the preferable way to split things up then?

CodePudding user response:

(With the caveat that SO frowns upon opinion-based questions…)

I think that you should let the programming experience of coding against the generated classes be the most important guidance on how much to break things out.

I find the first style much easier to deal with, and you’ve already pointed out performance benefits.

Like most guidance it can be taken too far in either direction. The second is pretty extreme in the “breaking things out” category and, in my opinion, going to be a pain to deal with the deeply nested structure. I think it appeals to our desire to “keep functions small” in general purpose languages, but you’ll spend a lot more time writing the rest of your code than you will getting the parser right.

Your first example comes close to going to the other extreme. I don’t think it quite crosses the line, but if it were to start getting out of hand, I might opt for something like:

grammar DBParser;

statement: expr EOF;
expr
    : case_expr     # caseExpression
    | ATOM          # constantExpression
    ;

case_expr
   : 'CASE' expr? ('WHEN' expr 'THEN' expr)  ('ELSE' expr)? 'END'
   ;

I’ve also found that I go back and rework the grammar sometimes as I’m fleshing out the language and find I want a slightly different structure. For example, I found that, when doing code completion, it was sometimes helpful to have different parser rules for different contexts that just required an ID token. The code completion could then tell me that I could use an assigment_dest instead of an `IDK and I could do much better completion, so, that’s a case where an extra level proved useful once I started using the grammar in practice.

To me, refining the grammar to give me the most useful ParseTree is the first priority.

  • Related