Reducing Time complexity of "Sed" command in bash script-CodePudding

I have a script for daily monitoring of my system that works based on reading from a log file. The command I enter to read and parse the string in the log file using the sed is as follows:

lastline= `cat logs1/$file | sed '/.\{100\}/!d' | sed -n '$p'`

Although this command works well, it takes a long time to execute and I have to reduce the time complexity of its execution. So, Unable to reduce file size. Do you suggest a better solution or alternative to this command?

logfile has 2-3 million lines and its data is like this:

21/11/02 10:05:53.906 | OUT   | OUT | [.0772230000340600720676E00000003406              100210055390                                  121676570608000000NOH1N1AFRN00AFRN136220211102100553254IRT1AFRN000100676            20211102000000029700000003581320000001463900070  1    1      120211102100553                                        H110B                0300000000                    184     202111020000000041        184980011  1849800118480208316         0000000000000000001               184-IR98001 080210     20211102085506 LJA1TSEDRHAHUB220000001463900 0000000000000                                                                                                                                                                                                                    0000000000000000000000000.]
21/11/02 10:05:55.607 | OUT   | IN  | [.000899.]
21/11/02 10:06:00.711 | OUT   | IN  | [.000899.]
21/11/02 10:06:05.714 | OUT   | IN  | [.000899.]
21/11/02 10:06:06.014 | OUT   | OUT | [.0772230000340700720676E00000003407              100210060601                                  121676574028000000NOH1N1SARV00SARV136220211102100605261IRT1SARV000100676            20211102000000100400000000992620000007140000070  1    1      120211102100605                                        H110B                0300000000                                                                                                                                                                                                                           120     202111020000002132        120980011  1209800112080208316         0000000000000000001               120-IR98001            20211102100448 LJA1TSEDRHFHUB220000007140000 0000000000000             0000000000000000000000000.]

In some lines (like line 3,4,2) we have incomplete data. So we should look for the last line with complete data. there is no set rule that I can use to determine the exact number of lines in which complete data exist. There may not be complete data up to line 1000 and it will not return the correct output. (this is why tail does not work)

P.S. Part of the code can be seen in this link : here

CodePudding user response：

With sed in one single invocation:

lastline=$(sed -n '/^.\{100\}/h;${g;p}' "logs1/$file")

Each line with at least 100 characters is copied to the hold space. At the end of the log file we copy the hold space to the pattern space and we print the pattern space.

If this is not fast enough you'll probably need to use something else than sed.

CodePudding user response：

Try this:

lastline=$(awk '(length>=100) {last=$0}; END {print last}' "logs1/$file")

Explanation: awk can do all of this itself, only looking at each line once. It just records the latest line of 100 or more characters in the last variable, and prints it at the end. It also reads directly from the file, avoiding the overhead of cat.

I don't know for certain if this'll be faster or how much so; it may depend on what version of awk you happen to have. But in principle it should be faster since it does less work on each line as it goes through the file.

If you really want it to be fast, I think you'd need to write something like a C program that seeks a ways before the end of file -- maybe a couple of thousand bytes -- and looks for a long line in just that last part of the file. If it doesn't find one, seek back a ways further, and try again.

CodePudding user response：

solution:

Print the last most line containing at least 100 characters:

grep '.\{100\}' | tail -n 1

It can also be done with a single sed:

sed -ne '/.\{100\}/h' -e '${x;p}'

But the grep will usually be faster than the sed. Especially if using GNU grep. It really depends on the grep implementation though.

These rough benchmarks can illustrate the point:

GNU:

$ time grep '.\{100\}' /tmp/rand-lines | tail $ -n 1 >/dev/null

real    0m0.278s
user    0m0.345s
sys 0m0.000s

$ time sed -ne '/.\{100\}/h' -e '${x;p}' /tmp/rand-lines >/dev/null

real    0m10.590s
user    0m10.580s
sys 0m0.000s

GNU grep, piped to tail -n 1, is 50x faster than GNU sed.

Busybox:

$ time busybox grep '.\{100\}' /tmp/rand-lines | tail -n 1 >/dev/null

real    0m10.340s
user    0m10.413s
sys 0m0.000s

$ time busybox sed -ne '/.\{100\}/h' -e '${x;p}' /tmp/rand-lines >/dev/null

real    0m10.588s
user    0m10.583s
sys 0m0.000s

On Busybox, which has a simpler grep implementation, grep still wins, but the difference is marginal.

The test file was 20,000 lines of random characters (printable ASCII spaces), containing 7058 lines that have at least 100 characters:

$ wc -l /tmp/rand-lines
20000 /tmp/rand-lines
$ grep -c '.\{100\}' /tmp/rand-lines
7058
$ head -n 1 /tmp/rand-lines
zJ_u)k_# K!-ZjR#x2{?>Xw3%xOx|):L^SV|=z&fEUJgn;oO9@[Wq[8I^UniwZ0q&CpL,n7]NI^WK7ke{t).=LFHXyI'Z$Dn!g ^ _,Hq<3X*f=>fm8=qYyh!WQUMo_,GLDPPy*N^.(G0!$; O9WcsSY

CodePudding user response：

we should look for the last line with complete data

Write a program that opens a file, seeks till the end, reads lines from the back of the file (by searching for a newline backwards in the file), then check if the line is "complete", and if it is, output the line and terminate the program.

sed can't read the file from the end. Gluing commands together in a pipeline will result that the command on the left will over-read the file and push data to the pipe, which will cause a lot of unnecessary I/O.