Problem
Note - this is not a duplicate of Regular expression to stop at first match, I need to match ALL occurrences of this in a file, which is why I'm using re-seq
I'd like to extract lambda parameters from the source code below using regex. Lambdas typically look like this: (param1, param2) -> body
, but sometimes they have type info like this: (param1: String, param2: String): String -> body
. It's valid to define a lambda over multiple lines like this:
(
x
)
->
1
Here's an example of a valid, nonsensical source file:
fun firstFunction(firstParam, secondParam) = do {
var lambdaWithName = (x) -> x 1
---
"data"
}
fun complicatedFunction(param1: String, param2: Array<Object>): Any =
((x1: String, x2: String) -> "hi")("foo", "bar")
The lambda parameters in the above script are x, x1, x2
. I don't necessarily need the regex to return individual parameters. I need at least a partial string of the lambda that includes the parameters (e.g. I could work with this: (x1: String, x2: String) ->
)
What I've tried
I'm using Clojure, which represents all of it's regex via java.util.regex.Pattern:
(type #"\s") => java.util.regex.Pattern
I was prototyping my regex with regex101.com using the Java 8 flavor and got good results with the following regex: \(\s*. \s*\)\s*->
See https://regex101.com/r/WeMq4a/1
But when I use this regex in code it matches far too much:
(re-seq #"\(\s*. \s*\)\s*->" src-code)
This matches everything from "(firstParam" in the first line to "->" on the last line.
I've also tried #"\(\s*.?\s*\)\s*->"
and #"\(\s*.??\s*\)\s*->"
but they only return a string for the first set of params (x) ->
but not the second (x1: String, s2: String) ->
.
CodePudding user response:
What you want is simply impossible.
The 'regular' in Regular Expression isn't chosen at random, nor were they invented by Ms. Jane Regular.
No, they refer to a certain class of grammars. Some grammars fit within the strict ruleset such that you can call them 'a regular grammar', and some don't.
Non-regular grammars CANNOT be parsed with regular expressions.
Most non-trivial source code grammars isn't regular. This certainly isn't.
Some examples of tricky source code (and no amount of futzing with your regexp will EVER fix it):
var notALambda = "lambdaWithName = (x) -> x 1";
var (x) /* comment */ -> 1;
var y = 5 /* (x) -> x 1 */;
Ouch, so what do I do?
Find a parser for this language written in some parser engine such as ANTLR, grappa, etc, parse the entire source tree, then walk through the resulting Abstract Syntax Tree.
You'll need a week or so to figure this all out, and that's if you're familiar with ASTs and parser engines. If not, you'll need more time.