I need to extract several lines of text (which vary in length along the 500 mb document) between a line that starts with Query # and two consecutive carriage returns. This is being done in a Mac. For example de document format is:
Query #1: 020.1-Bni_its1_2019_envio1set1
lines I need to extract
Alignments (the following lines I don't need)
xyz
xyx
Query #2: This and the following lines I need. And so on.
There are always exactly two carriage returns before the word "Alignments". So basically I need all the lines from Query #.: until Alignments.
I tried the following regex but I only recover the first line.
ggrep -P 'Query #.*?(?:[\r\n]{2}|\Z)'
I have tested the regex with multiple iterations here regex101, but have not yet found the answer.
thanks in advance for any pointers.
CodePudding user response:
With pcregrep
, you can use
pcregrep -oM 'Query #.*(?:\R(?!\R{2}).*)*' file.txt > results.txt
Here,
o
- outputs matched textsM
- enables matching across lines (puts line endings into "pattern space")Query #.*(?:\R(?!\R{2}).*)*
matchesQuery #
- literal text.*
- the rest of the line(?:\R(?!\R{2}).*)*
- zero or more sequences of a line break sequence (\R
) not immediately followed with two line break sequences ((?!\R{2})
) and then the rest of the line.
CodePudding user response:
From https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Using any awk in any shell on every Unix box:
$ awk '/^Query #/{f=1} /^Alignments/{f=0} f' file
Query #1: 020.1-Bni_its1_2019_envio1set1
lines I need to extract
Query #2: This and the following lines I need. And so on.
You don't show the expected output in your question so I don't know for sure that the above is the output you want but if it's not then it'll be a trivial change to do whatever it is you do want.