Home > Software engineering >  Match first double parenthesis instead of last
Match first double parenthesis instead of last

Time:10-31

I am trying for a long time now to replace:

(a (b ((c) (d)) (e) :hello ((f (g) h)))))

by

(a (b ((c) (d)) (e)))

hello does not appear anywhere else in the string. If tried a lot of different stuff but thought it should work like this:

 sed -i 's/\s:hello\s.*))//g'

However, it does not seem to match the first two parenthesis, i.e.

(a (b ((c) (d)) (e) :hello ((f (g) h ))))

but the last two

(a (b ((c) (d)) (e) :hello ((f (g) h)))))

and thus deletes everything after the :hello.

I also tried working with [^)]* but could only get it to take one parenthesis and not two and since there is a closing parenthesis after g it stopped there.

CodePudding user response:

perl is better suited here, as it supports non-greedy matching. The below command will match up to the first occurrence of )) after hello:

$ s='(a (b ((c) (d)) (e) :hello ((f (g) h))))'
$ echo "$s" | perl -pe 's/\s:hello\s.*?\)\)//'
(a (b ((c) (d)) (e)))

# you can also recursively match balanced parentheses
$ cat ip.txt
(a (b ((c) (d)) (e) :hello ((f (g) h))))
(a (b ((c) (d)) (e) :hello (f (g) h)))
(a (b ((c) (d)) (e) :hello (f h)))
(a (b ((c) (d)) (e) :hello ((f ((c) (d)) h))))
$ perl -pe 's/\s:hello\s(\((?:[^()]  |(?1))  \))//' ip.txt
(a (b ((c) (d)) (e)))
(a (b ((c) (d)) (e)))
(a (b ((c) (d)) (e)))
(a (b ((c) (d)) (e)))

You can use some tricks to get it working with sed. In the below solution, all occurrences of )) are first replaced with newline (since this character cannot be part of the input line in the default usage). [^\n] can now be used to match up to first occurrence only. After that, change all newlines back to )).

$ s='(a (b ((c) (d)) (e) :hello ((f (g) h))))'
$ echo "$s" | sed 's/))/\n/g; s/\s:hello\s[^\n]*\n//; s/\n/))/g'
(a (b ((c) (d)) (e)))

CodePudding user response:

.* means "skip as much as possible." Don't use that if you don't mean that.

Like you already discovered, the regex for "not a right parethesis" is [^)]. However, you want to permit one parenthesis as long as it's not immediately followed by another. This gets a bit ugly, as you need \(...\|...\) around the alternatives. (Switching to sed -r or sed -E would not really improve the situation much, as while you could then avoid the backslashes in this construct, you'd then have to backslash or otherwise escape the literal parentheses, outside of character classes.)

sed 's/\s:hello\s\([^)]\|)[^)]\)*))//g'

The -i option does not make sense here (if you actually have a file to process and you want to process it in-place, maybe put it back) and the \s is not portable (switch to [[:space:]] for the POSIX equivalent).

As noted in another answer, more modern regex tools offer non-greedy quantifiers which skip as little as possible instead. It's still good to think about articulating a precise requirement; non-greedy matching is just one more tool for being precise. Too many beginners are confused and use it as a "do what I mean" hammer which of course it isn't at all.

CodePudding user response:

If your data is like your sample, then you can match everything from the colon to the last 3 parenthesis after the final letter and replace it with nothing.

Using sed

sed 's/ :.*[a-z])))//' input_file
(a (b ((c) (d)) (e)))
  • Related