Home > front end >  Speed up searching a large file using sed or an alternative
Speed up searching a large file using sed or an alternative

Time:05-25

I have several large files in which I need to find a specific string and take everything between the line which contains the string and the next date at the beginning of a line. This file looks like this:

20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo

The output I need is this:

20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: bla this_i_need bla

Now I'm using sed '/'"$string"'/,/'"$date"'/!d' which works as intended except it also takes the next row with the date even if it doesn't contain the string, but it's not a big problem.

The problem is that it takes a really long time searching the files. Is it possible to edit the sed command so it will run faster or is there any other option to get a better runtime? Maybe using awk or grep?

EDIT: I forgot to add that the expected results occur multiple times in one file, so exiting after one match is not suitable. I am looping trough multiple files in a for loop with the same $string and same $date. There are a lot of factors slowing the script down that i can't change (extracting files one by one from a 7z, searching and removing them after search in one loop).

CodePudding user response:

Using sed you might use:

sed -n '/this_i_need/{:a;N;/\n20220520/!ba;p;q}' file

Explanation

  • -n Prevent default printing of a line
  • /this_i_need/ When matching this_i_need
  • :a Set a label a to be able to jump back to
  • N pull the next line into the pattern space
  • /\n20220520/! If not matching a newline followed by the date
  • ba Jump back to the label (like a loop and process what is after the label again)
  • p When we do match a newline and the date, then print the pattern space
  • q Exit sed

Output

20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla

CodePudding user response:

With sed it has to delete all the lines outside the matching ranges from the buffer, which is inefficient when the file is large.

You can instead use awk to output the desired lines directly by setting a flag upon matching the specific string and clearing the flag when matching a date pattern, and outputting the line when the flag is set:

awk '/[0-9]{8}/{f=0}/this_i_need/{f=1}f' file

Demo: https://ideone.com/J2ISVD

CodePudding user response:

You might use exit statement to instruct GNU AWK to stop processing, which should give speed gain if lines you are looking ends far before end of file. Let file.txt content be

20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo

then

awk 's&&/^[[:digit:]]{8}.*this_i_need/{print;exit}/this_i_need/{p=1;s=1;next}p&&/^[[:digit:]]{8}/{p=0}p{print}' file.txt

gives output

what 
to 
do
20220520-11:53:01.257: bla this_i_need bla

Explanation: I use 2 flag-variables p as priting and s as seen. I inform GNU AWK to

  • print current line and exit if seen and line starts with 8 digits followed by 0 or more any characters followed by this_i_need
  • set p flag to 1 (true) and s flag to 1 (true) and go to next line if this_i_need was found in line
  • set p flag to 0 (false) if p flag is 1 and line starts with 8 digit
  • print current line if p flag is set to 1

Note that order of actions is crucial.

Disclaimer: this solution assumes that if line starts with 8 digits, then it is line beginning with date, if this is not case adjust regular expression according to your needs.

(tested in gawk 4.2.1)

CodePudding user response:

Assumptions:

  • start printing when we find the desired string
  • stop printing when we read a line that starts with any date (ie, any 8-digit string)

One awk idea:

string='this_i_need'

awk -v ptn="${string}" '         # pass bash variable "$string" in as awk variable "ptn"
/^[0-9]{8}/ { printme=0 }        # clear printme flag if line starts with 8-digit string
$0 ~ ptn    { printme=1 }        # set printme flag if we find "ptn" in the current line
printme                          # only print current line if printme==1
' foo.dat

Or as a one-liner sans comments:

awk -v ptn="${pattern}" '/^[0-9]{8}/ {printme=0} $0~ptn {printme=1} printme' foo.dat

NOTE: OP can rename the awk variables (ptn, printme) as desired as long as they are not a reserved keyword (see 'Keyword' in awk glossary)

This generates:

20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla
  • Related