Home > OS >  sed regexp - extra unwanted line in matching output
sed regexp - extra unwanted line in matching output

Time:10-13

I have this file

~/ % cat t
---
    abc
def DEF    
ghi GHI
---
123
456

and I would like to extract the content between the three dashes, so I try

sed -En '{N; /^---\s{5}\w /,/^---/p}' t

I.e. 3 dashes followed by 5 whitespaces including the newline, followed by one or more word characters and ending with another set of three dashes. This gives me this output

~/ % sed -En '{N; /^---\s{5}\w /,/^---/p}' t
--- 
    abc
def DEF
ghi GHI
---
123

I don't want the line with "123". Why am I getting that and how do I adjust my expression to get rid of it? [EDIT]: It is important that the four spaces of indentation after the first three dashes are matched in the expression.

CodePudding user response:

No need to use the pattern space here - a range pattern will do fine.

$ sed -n '/^---/,/^---/p' t 
---
    abc
def DEF    
ghi GHI
---

Tested in GNU sed 4.7 and OSX sed.

CodePudding user response:

I believe you can use

perl -0777 -ne '/^---\R(\s{4}\w.*?^---)/gsm && print "$1\n";' t

Details:

  • -0777 - slurps the file into a single variable
  • ^---\R(\s{4}\w.*?^---) - start of a line (^), ---, a line break, then Group 1: four whitespaces, a word char, then zero or more chars as few as possible, and then --- at the start of a line
  • gsm - global, all occurrences are returned, s means . matches any chars including line break chars, as m means ^ now matches start of any line, not just string start
  • && print "$1\n" - if there is a match, print Group 1 value a line break.

CodePudding user response:

This might work for you (GNU sed):

sed -En '/^---/{:a;N;/^ {4}\S/M!D;/\n---/!ba;p;d}' file

Turn on extended regexp (-E) and off implicit printing (-n).

If a line begins --- and the following line is indented by 4 spaces, gather up the following lines until another begins --- and print them.

If the following line does not match the above criteria, delete the first and repeat.

All other lines will pass through unprinted.

N.B. The M flag on the second regexp for multiline matching , since the first line already begins --- the next must be indented.

  • Related