I'm planning on writing a Parser for some language. I'm quite confident that I could cobble together a parser in Parsec without too much hassle, but I thought about including comments into the AST so that I could implement a code formatter in the end.
At first, adding an extra parameter to the AST types seemed like a suitable idea (this is basically what was suggested in this answer). For example, instead of having
data Expr = Add Expr Expr | ...
one would have
data Expr a = Add a Expr Expr
and use a
for whatever annotation (e.g. for comments that come after the expression).
However, there are some not so exciting cases. The language features C-like comments (// ...
, /* .. */
) and a simple for loop like this:
for (i in 1:10)
{
... // list of statements
}
Now, excluding the body there are at least 10
places where one could put one (or more) comments:
/*A*/ for /*B*/ ( /*C*/ i /*E*/ in /*F*/ 1 /*G*/ : /*H*/ 10 /*I*/ ) /*J*/
{ /*K*/
...
In other words, while the for loop could previously be comfortably represented as an identifier (i
), two expressions (1
& 10
) and a list of statements (the body), we would now at least had to include 10
more parameters or records for annotations.
This get ugly and confusing quite quickly, so I wondered whether there is a clear better way to handle this. I'm certainly not the first person wanting to write a code formatter that preserves comments, so there must be a decent solution or is writing a formatter just that messy?
CodePudding user response:
You can probably capture most of those positions with just two generic comment productions:
Expr -> Comment Expr
Stmt -> Comment Stmt
This seems like it ought to capture comments A, C, F, H, J, and K for sure; possibly also G depending on exactly what your grammar looks like. That only leaves three spots to handle in the for
production (maybe four, with one hidden in Range
here):
Stmt -> "for" Comment "(" Expr Comment "in" Range Comment ")" Stmt
In other words: one before each literal string but the first. Seems not too onerous, ultimately.