Home > Software engineering >  Java 8 regex - Matching Lambdas in Source Files
Java 8 regex - Matching Lambdas in Source Files

Time:12-02

Problem

Note - this is not a duplicate of Regular expression to stop at first match, I need to match ALL occurrences of this in a file, which is why I'm using re-seq

I'd like to extract lambda parameters from the source code below using regex. Lambdas typically look like this: (param1, param2) -> body, but sometimes they have type info like this: (param1: String, param2: String): String -> body. It's valid to define a lambda over multiple lines like this:

(
x
)
->
1

Here's an example of a valid, nonsensical source file:

fun firstFunction(firstParam, secondParam) = do {
    var lambdaWithName = (x) -> x   1
    ---
    "data"
}

fun complicatedFunction(param1: String, param2: Array<Object>): Any =
  ((x1: String, x2: String) -> "hi")("foo", "bar")

The lambda parameters in the above script are x, x1, x2. I don't necessarily need the regex to return individual parameters. I need at least a partial string of the lambda that includes the parameters (e.g. I could work with this: (x1: String, x2: String) ->)

What I've tried

I'm using Clojure, which represents all of it's regex via java.util.regex.Pattern:

(type #"\s") => java.util.regex.Pattern

I was prototyping my regex with regex101.com using the Java 8 flavor and got good results with the following regex: \(\s*. \s*\)\s*-> See https://regex101.com/r/WeMq4a/1

But when I use this regex in code it matches far too much:

(re-seq #"\(\s*. \s*\)\s*->" src-code) 

This matches everything from "(firstParam" in the first line to "->" on the last line.

I've also tried #"\(\s*.?\s*\)\s*->" and #"\(\s*.??\s*\)\s*->" but they only return a string for the first set of params (x) -> but not the second (x1: String, s2: String) ->.

CodePudding user response:

What you want is simply impossible.

The 'regular' in Regular Expression isn't chosen at random, nor were they invented by Ms. Jane Regular.

No, they refer to a certain class of grammars. Some grammars fit within the strict ruleset such that you can call them 'a regular grammar', and some don't.

Non-regular grammars CANNOT be parsed with regular expressions.

Most non-trivial source code grammars isn't regular. This certainly isn't.

Some examples of tricky source code (and no amount of futzing with your regexp will EVER fix it):

var notALambda = "lambdaWithName = (x) -> x   1";
var (x) /* comment */ ->   1;
var y = 5 /* (x) -> x   1 */;

Ouch, so what do I do?

Find a parser for this language written in some parser engine such as ANTLR, grappa, etc, parse the entire source tree, then walk through the resulting Abstract Syntax Tree.

You'll need a week or so to figure this all out, and that's if you're familiar with ASTs and parser engines. If not, you'll need more time.

  • Related