Regex to extract S expression?-CodePudding

I'm wondering if it's possible to do a pass on parsing of a define expression in lisp with a single regular expression, for example with the following input:

#lang sicp
(define (square x) (* x x))
(define (average x y) (/ (  x y) 2))

; using block scope
(define (sqrt x)
   ; x is always the same -- "4" or whatever we pass to it, so we don't need that
   ; in every single function that we define, we can just inherit it from above.
   (define (improve guess) (average guess (/ x guess)))
   (define (good-enough? guess) (< (abs (- (square guess) x)) 0.001 ))
   (define (sqrt-iter guess) (if (good-enough? guess) guess (sqrt-iter (improve guess))))
   (sqrt-iter 1.0)
)

(sqrt 4)

I would want to highlight the three procedures below (none of the function-scoped procedures) that start with define. The process I was thinking (if I were to do it iteratively) would be:

Remove comments.
Grab the start of the define with \(\s*define
Consume balanced parentheses up until the unbalanced ) that finishes our procedure. For a regex, something like: (?:\([^)]*\))*, though I'm sure it gets much more complex with the greediness of the *'s.

And this wouldn't even be taking into account I could have a string "( define )" that we'd also want to ignore.

Would it be possible to build a regex for this, or too complicated? Here is my starting point, which is a long way from complete: https://regex101.com/r/MlPmOd/1.

CodePudding user response：

As a preface, there is a famous quote due to Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

The one word answer to your question is 'no'. Regular languages – the languages that regular expressions can recognise – are a proper subset of context-free languages, and the written form of s-expressions is context-free but not regular. So no regular expression can recognise the written form of an s-expression.

To see this consider a very tiny subset of s-expressions:

n = () | ( n)

So n consists of the set {(), (()), ((())), ...}, where the number of left parens and right parens in each string are equal. Such a language can't be recognised by a regular expression because you need to count parens.

Notes

Some instances of what are called 'regular expressions' in various programming languages are in fact more powerful than regular expressions and can therefore recognise classes of languages larger than regular languages. jwz's quote still applies: just because, perhaps, you can does not mean you should.
All programmers should in my opinion learn enough formal language theory to be dangerous. I don't know what a good modern reference is, but I learnt it from the Cinderella book: Hopcroft & Ullman, Introduction to Automata Theory, Languages, and Computation.
All Lisp programmers should in my opinion write a toy reader for s-expressions, as this is a good way of learning about how the real reader works, and doesn't take long.