I have several large files in which I need to find a specific string and take everything between the line which contains the string and the next date at the beginning of a line. This file looks like this:
20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo
The output I need is this:
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla
Now I'm using sed '/'"$string"'/,/'"$date"'/!d'
which works as intended except it also takes the next row with the date even if it doesn't contain the string, but it's not a big problem.
The problem is that it takes a really long time searching the files.
Is it possible to edit the sed
command so it will run faster or is there any other option to get a better runtime? Maybe using awk or grep?
EDIT: I forgot to add that the expected results occur multiple times in one file, so exiting after one match is not suitable. I am looping trough multiple files in a for loop with the same $string and same $date. There are a lot of factors slowing the script down that i can't change (extracting files one by one from a 7z, searching and removing them after search in one loop).
CodePudding user response:
Using sed
you might use:
sed -n '/this_i_need/{:a;N;/\n20220520/!ba;p;q}' file
Explanation
-n
Prevent default printing of a line/this_i_need/
When matchingthis_i_need
:a
Set a labela
to be able to jump back toN
pull the next line into the pattern space/\n20220520/!
If not matching a newline followed by the dateba
Jump back to the label (like a loop and process what is after the label again)p
When we do match a newline and the date, then print the pattern spaceq
Exit sed
Output
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: blablabla
CodePudding user response:
With sed it has to delete all the lines outside the matching ranges from the buffer, which is inefficient when the file is large.
You can instead use awk to output the desired lines directly by setting a flag upon matching the specific string and clearing the flag when matching a date pattern, and outputting the line when the flag is set:
awk '/[0-9]{8}/{f=0}/this_i_need/{f=1}f' file
Demo: https://ideone.com/J2ISVD
CodePudding user response:
You might use exit
statement to instruct GNU AWK
to stop processing, which should give speed gain if lines you are looking ends far before end of file. Let file.txt
content be
20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo
then
awk 's&&/^[[:digit:]]{8}.*this_i_need/{print;exit}/this_i_need/{p=1;s=1;next}p&&/^[[:digit:]]{8}/{p=0}p{print}' file.txt
gives output
what
to
do
20220520-11:53:01.257: bla this_i_need bla
Explanation: I use 2 flag-variables p
as priting and s
as seen. I inform GNU AWK
to
print
current line andexit
if seen and line starts with 8 digits followed by 0 or more any characters followed bythis_i_need
- set
p
flag to1
(true) ands
flag to1
(true) and go tonext
line ifthis_i_need
was found in line - set
p
flag to0
(false) ifp
flag is1
and line starts with 8 digit print
current line ifp
flag is set to1
Note that order of actions is crucial.
Disclaimer: this solution assumes that if line starts with 8 digits, then it is line beginning with date, if this is not case adjust regular expression according to your needs.
(tested in gawk 4.2.1)
CodePudding user response:
Assumptions:
- start printing when we find the desired string
- stop printing when we read a line that starts with any date (ie, any 8-digit string)
One awk
idea:
string='this_i_need'
awk -v ptn="${string}" ' # pass bash variable "$string" in as awk variable "ptn"
/^[0-9]{8}/ { printme=0 } # clear printme flag if line starts with 8-digit string
$0 ~ ptn { printme=1 } # set printme flag if we find "ptn" in the current line
printme # only print current line if printme==1
' foo.dat
Or as a one-liner sans comments:
awk -v ptn="${pattern}" '/^[0-9]{8}/ {printme=0} $0~ptn {printme=1} printme' foo.dat
NOTE: OP can rename the awk
variables (ptn
, printme
) as desired as long as they are not a reserved keyword (see 'Keyword' in awk glossary)
This generates:
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla