Home > other >  How can I post-process a capture group while preserving others in sed?
How can I post-process a capture group while preserving others in sed?

Time:11-20

I have a string with 3 capture groups and would like to preserve the first and third but perform a substitution on the second. How do I express this in sed?

Concretely, I have an input string like:

top-level.subpath.one.subpath.two.subpath.forty-five

And I want to preserve the part before the first ., shorten the middle part to the first letter of every word, and preserve the part after the last .. The result should look like:

top-level.s.o.s.t.s.forty-five

For preserving the capture groups, I have:

sed -r 's/([^.]*)(.*)(\..*)/\1...\3/'

which gets me:

top-level....forty-five

For converting something like .subpath.one.subpath.two.subpath to only initials, I have:

sed -r 's/(\.[^.])[^\.]*/\1/g'

which gets me:

.s.o.s.t.s

I'd like to essentially apply that second sed expression to capture group 2. Is there some way I can chain sed substitutions to perform that second substitution on only the second capture group while retaining the first and third?

CodePudding user response:

You can use

sed -E ':a; s/^(.*\.[^.])[^.] (\.)/\1\2/; ta' file > newfile         # GNU sed
sed -E -e :a -e 's/^(.*\.[^.])[^.] (\.)/\1\2/' -e ta file > newfile  # FreeBSD sed

See the online demo. Details:

  • -E - enables POSIX ERE syntax ( is now a one or more quantifier, (...) is parsed as a grouping construct)
  • :a - sets an a label
  • s/^(.*\.[^.])[^.] (\.)/\1\2/ - finds zero or more chars, a . and then any single char other than a . (capturing this into Group 1), then one or more chars other than a ., and then matches and captures into Group 2 a dot char, the match is replaced with concatenated Group 1 Group 2 values
  • ta - goes to the a label upon successful replacement.

CodePudding user response:

A simple awk solution that will work with any version of awk including MacOS:

s='top-level.subpath.one.subpath.two.subpath.forty-five'
awk 'BEGIN{FS=OFS="."} {for(i=2;i<NF;  i) $i=substr($i,1,1)}1' <<< "$s"

top-level.s.o.s.t.s.forty-five

This awk command uses . as input and output field separator. We loop through field position 2 to last-1 and replace value of each field with the first character of that field. In the end we print full record.


A BSD sed solution to do the same:

sed -E -e ':x' -e 's/(. \..)[^.] \./\1./; tx' <<< "$s"

top-level.s.o.s.t.s.forty-five

CodePudding user response:

This might work for you (GNU sed):

sed -E ':a;s/(\..*)\B.(.*\.)/\1\2/;ta' file

Capture the first and last periods and hollow out the middle removing any side-by-side word characters.


Ameliorating @anubhava's sed answer:

sed -E 's/(\..)[^.] \./\1./g;s//\1./g' file

Using the global flag and repeating the same substitution provides a 2 command solution.

  • Related