Regex select several lines until two consecutive new lines not working on Mac-CodePudding

I need to extract several lines of text (which vary in length along the 500 mb document) between a line that starts with Query # and two consecutive carriage returns. This is being done in a Mac. For example de document format is:

Query #1: 020.1-Bni_its1_2019_envio1set1 

lines I need to extract


Alignments (the following lines I don't need)

xyz
xyx

Query #2: This and the following lines I need. And so on.

There are always exactly two carriage returns before the word "Alignments". So basically I need all the lines from Query #.: until Alignments.

I tried the following regex but I only recover the first line.

ggrep -P 'Query #.*?(?:[\r\n]{2}|\Z)'

I have tested the regex with multiple iterations here regex101, but have not yet found the answer.

thanks in advance for any pointers.

CodePudding user response：

With pcregrep, you can use

pcregrep -oM 'Query #.*(?:\R(?!\R{2}).*)*' file.txt > results.txt

Here,

o - outputs matched texts
M - enables matching across lines (puts line endings into "pattern space")
Query #.*(?:\R(?!\R{2}).*)* matches
- Query # - literal text
- .* - the rest of the line
- (?:\R(?!\R{2}).*)* - zero or more sequences of a line break sequence (\R) not immediately followed with two line break sequences ((?!\R{2})) and then the rest of the line.

CodePudding user response：

From https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Using any awk in any shell on every Unix box:

$ awk '/^Query #/{f=1} /^Alignments/{f=0} f' file
Query #1: 020.1-Bni_its1_2019_envio1set1

lines I need to extract


Query #2: This and the following lines I need. And so on.

You don't show the expected output in your question so I don't know for sure that the above is the output you want but if it's not then it'll be a trivial change to do whatever it is you do want.