Home > Blockchain >  Regex select several lines until two consecutive new lines not working on Mac
Regex select several lines until two consecutive new lines not working on Mac

Time:02-16

I need to extract several lines of text (which vary in length along the 500 mb document) between a line that starts with Query # and two consecutive carriage returns. This is being done in a Mac. For example de document format is:

Query #1: 020.1-Bni_its1_2019_envio1set1 

lines I need to extract


Alignments (the following lines I don't need)

xyz
xyx

Query #2: This and the following lines I need. And so on.

There are always exactly two carriage returns before the word "Alignments". So basically I need all the lines from Query #.: until Alignments.

I tried the following regex but I only recover the first line.

ggrep -P 'Query #.*?(?:[\r\n]{2}|\Z)' 

I have tested the regex with multiple iterations here regex101, but have not yet found the answer.

thanks in advance for any pointers.

CodePudding user response:

With pcregrep, you can use

pcregrep -oM 'Query #.*(?:\R(?!\R{2}).*)*' file.txt > results.txt

Here,

  • o - outputs matched texts
  • M - enables matching across lines (puts line endings into "pattern space")
  • Query #.*(?:\R(?!\R{2}).*)* matches
    • Query # - literal text
    • .* - the rest of the line
    • (?:\R(?!\R{2}).*)* - zero or more sequences of a line break sequence (\R) not immediately followed with two line break sequences ((?!\R{2})) and then the rest of the line.

CodePudding user response:

From https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Using any awk in any shell on every Unix box:

$ awk '/^Query #/{f=1} /^Alignments/{f=0} f' file
Query #1: 020.1-Bni_its1_2019_envio1set1

lines I need to extract


Query #2: This and the following lines I need. And so on.

You don't show the expected output in your question so I don't know for sure that the above is the output you want but if it's not then it'll be a trivial change to do whatever it is you do want.

  • Related