Home > Software design >  Including comments in AST
Including comments in AST

Time:10-02

I'm planning on writing a Parser for some language. I'm quite confident that I could cobble together a parser in Parsec without too much hassle, but I thought about including comments into the AST so that I could implement a code formatter in the end.

At first, adding an extra parameter to the AST types seemed like a suitable idea (this is basically what was suggested in this answer). For example, instead of having

data Expr = Add Expr Expr | ...

one would have

data Expr a = Add a Expr Expr

and use a for whatever annotation (e.g. for comments that come after the expression).

However, there are some not so exciting cases. The language features C-like comments (// ..., /* .. */) and a simple for loop like this:

for (i in 1:10)
{
   ... // list of statements
}

Now, excluding the body there are at least 10 places where one could put one (or more) comments:

/*A*/ for /*B*/ ( /*C*/ i /*E*/ in /*F*/ 1 /*G*/ : /*H*/ 10 /*I*/ ) /*J*/ 
{ /*K*/
...

In other words, while the for loop could previously be comfortably represented as an identifier (i), two expressions (1 & 10) and a list of statements (the body), we would now at least had to include 10 more parameters or records for annotations. This get ugly and confusing quite quickly, so I wondered whether there is a clear better way to handle this. I'm certainly not the first person wanting to write a code formatter that preserves comments, so there must be a decent solution or is writing a formatter just that messy?

CodePudding user response:

You can probably capture most of those positions with just two generic comment productions:

Expr -> Comment Expr
Stmt -> Comment Stmt

This seems like it ought to capture comments A, C, F, H, J, and K for sure; possibly also G depending on exactly what your grammar looks like. That only leaves three spots to handle in the for production (maybe four, with one hidden in Range here):

Stmt -> "for" Comment "(" Expr Comment "in" Range Comment ")" Stmt

In other words: one before each literal string but the first. Seems not too onerous, ultimately.

  • Related