I'm currently beginning an automated software analysis project of which I am the research phase. I'm quite new to parsing and struggling to find info on resources regarding comparisons between the main java parsing options. I understand JavaParser was created using JavaCC, what functionalities does it contain that JavaCC does not? Are there any primary differences I should be aware of when making a decision as to which parser to use. Similarly, are there features that the Eclipse JDT contains compared to these two which may be of use to me? Thank you for any answers in advance.
CodePudding user response:
That's by no means an exhaustive answer, just a bit of clarification on the specific part of your questions and my 5 cents on the more general one. I assume, that you want to analyze Java code.
I also assume that it is sort of exercise in using code-as-data and grammars/parsers. Otherwise the field of code analysis itself is huge with very specific niches like finding bugs or checking code for thread safety, say.
Since its the bedrock of the whole field of software development ever since people stopped expressing algorithms as numbers encoded as holes punched in cards there's a huge amount of tools available for the purpose, but if we limit them to those written in Java the biggest fish in the open source space seem to be covered here. For a more complete list see this blog from some of the authors of JavaParser and this for a general introduction to the topic. It may also be worth it to have a look at their material on the somewhat overlapping topic of language development in general.
In an ex post view those question were lurking in the background of this response:
- Do you need to parse in the first place? E.g. getting word or line counts won't need full blown parsing. Regex or a scanner (often the first stage in parsing) might do if you want to elicit all string constants or identifiers.
- Is it an interactive (IDE) setting with lots of feedback, editing support and continuous incremental compilation in the background needed?
- Do you need to handle incomplete or (temporarily) broken code?
- Do you have to deal with stuff that goes beyond parsing, e.g. type checking?
- Is it only about analysis or transformations also?
- Whats the size of the code to handle in given time constraints? More generic tools won't give you the fastest possible processing.
- Do you need a compact stand alone tool or can you live with a zoo of dependencies?
- How well is the structure of the output suited to the intended operations on it? All tools will give you an abstract syntax tree (AST) for a given piece of code, but each AST will be different (will be discussed below).
Let's go from the specific to the general:
com.github.javaparser parses a static piece of java code (note: only java, only static) and gives you an AST. The package also has SymbolResolver, which tries to determine the Java type of symbols. Its called JavaParser, but it isn't just a parser, it supports Java streams for querying and comes with AST manipulation and code generation capabilities. A main backer is an Italian company btw.
Eclipse jdt is comparably huge, with org.eclipse.jdt.core.dom.ASTParser giving you an AST. But as opposed to JavaParser everything is geared towards handling Java (only) in an interactive development situation. Since Eclipse can perform refactorings, it must be able to analyze and manipulate the AST, here's an example for that (as part of this post) and here are comprehensive examples for the refactoring api. If you're building some Eclipse integrated functionality to support writing of code, that will be your first option anyway. Eclipse Jdt supports incremental compilation in some form which you need if you want some compile-on-the-fly-and-give-feedback-as-the-code-gets-typed functionality.
I also worked a bit with the spoon library (developed by a university in France) which has the same focus as JavaParser, also does symbol resolution but has different querying mechanisms. It builds on org.eclipse.jdt.core. Each of those tools will give you a different AST for the same java code reflecting their intended use case, spoon describes it like this:
A programming language can have different meta models. An abstract syntax tree (AST) or model, is an instance of a meta model. Each meta model – and consequently each AST – is more or less appropriate depending on the task at hand. For instance, the Java meta model of Sun’s compiler (javac) has been designed and optimized for compilation to bytecode, while, the main purpose of the Java meta model of the Eclipse IDE (JDT) is to support different tasks of software development in an integrated manner (code completion, quick fix of compilation errors, debug, etc.).
Which framework suits your need will depend very much on your use case. E.g. if you need symbol resolution, you're probably bound to those options that provide it anyway. I tried to get my feet wet with a Java transpiler and found the JavaParser metamodel more suitable than spoon's model and liked its small number of dependencies.
A general (though non-incremental) way to get a handle at an AST would be a parser generator like JavaCC (read: compiler compiler (aka compiler generator) written in Java that can create parsers for anything you have a grammar for) or ANTLR. If you want to parse SQL, you feed them a sql grammar, if you want to parse Java code, you feed them this one (ANTLR-format) or this one (JavaCC-format). The result will be a parser which can give you an AST for a given piece of code and a visitor class perhaps.
Apart from being general the argument for this approach that I see would be tighter control and perhaps the possibility to tweak the Java grammar depending on your needs, e.g. to introduce additional non-terminal nodes. Or maybe, if you want to get fancy, you could use it to parse embedded non-Java code fragments, e.g. SQL query strings.
Btw. ANTLR can handle direct left recursion in the grammar, while JavaCC can't, e.g. for arithmetic expressions for binary operators like in exp := exp exp
If your goal is to support developer activities as they write the code you'll have to deal with broken or incomplete code. Eclipse is build for the purpose and while I didn't use its jdt I'd expect it to handle such cases gracefully with reasonable feedback. Also ANTLR will recover from syntax errors if possible and give you an AST for everything that is not broken, so to speak. I don't remember what spoon and JavaParser did in case of errors, I think, they expect syntactically correct code upfront.