Home > Back-end >  Weird words replacement by using regexp in clojure
Weird words replacement by using regexp in clojure

Time:04-25

I would like to replace all demo words in "demo demo demo demo demo1 Demo" by using the following codes, but the result seems a little bit weird.

(string/replace
 "demo demo demo demo demo1 Demo"
 (re-pattern (str "(?i)" "(^|\\s)(" "demo" ")($|\\s)"))
 "$1[[$2]]$3")

;; => "[[demo]] demo [[demo]] demo demo1 [[Demo]]"

Why are the second and fourth ones not been replaced? Appreciated any explanation and solutions.

edit: did some experiments. If another space is added before the second word, then it can be successfully replaced, so it looks like the word boundary cannot be used twice. I can do the replacement twice to replace all those demo words, but it is cumbersome. Is there any better solutions?

(string/replace
 "demo  demo demo demo demo1 Demo"
 (re-pattern (str "(?i)" "(^|\\s)(" "demo" ")($|\\s)"))
 "$1[[$2]]$3")

;; => "[[demo]]  [[demo]] demo [[demo]] demo1 [[Demo]]"

CodePudding user response:

I would use \b (a word boundary) like this:

(clojure.string/replace "demo demo demo demo demo1 Demo"
                        #"\b(demo)\b"
                        "[[$1]]")

=> "[[demo]] [[demo]] [[demo]] [[demo]] demo1 Demo"

If you also want to match Demo:

(clojure.string/replace "demo demo demo demo demo1 Demo"
                        #"\b([d|D]emo)\b"
                        "[[$1]]")

=> "[[demo]] [[demo]] [[demo]] [[demo]] demo1 [[Demo]]"

CodePudding user response:

Martin Půda's answer, using the \b zero-width word-boundary matching pattern, is probably the best for your needs.

If you want to understand why your answer is not doing what you want, the crux is that you are expecting overlapping matches. By default, the Java Matcher class and its Clojure equivalent assumes non-overlapping matches. In your particular case, your pattern is "Start of line or space, followed by the string 'demo', followed by a space or end of line". Therefore the first match is using up the space after the first 'demo' and thus the second 'demo' does not match the pattern. That is also why when you added two spaces, your pattern worked.

The general way in which to handle overlapping matches in a regex pattern is to use zero-width lookaheads and lookbehinds in the pattern. Here is an example that solves your particular problem with just small changes to your answer.

user> (clojure.string/replace
       "demo demo demo demo demo1 Demo"
       (re-pattern (str "(?i)" "(\\s?)(?<=^|\\s)(" "demo" ")(?=\\s|$)(\\s?)"))
       "$1[[$2]]$3")
"[[demo]] [[demo]] [[demo]] [[demo]] demo1 [[Demo]]"

To better show each of the matches, we can enclose each match in curly braces as follows

user> (clojure.string/replace
       "demo demo demo demo demo1 Demo"
       (re-pattern (str "(?i)" "(\\s?)(?<=^|\\s)(" "demo" ")(?=\\s|$)(\\s?)"))
       "{$1[[$2]]$3}")
"{[[demo]] }{[[demo]] }{[[demo]] }{[[demo]] }demo1{ [[Demo]]}"
  • Related