I have a script for daily monitoring of my system that works based on reading from a log file.
The command I enter to read and parse the string in the log file using the sed
is as follows:
lastline= `cat logs1/$file | sed '/.\{100\}/!d' | sed -n '$p'`
Although this command works well, it takes a long time to execute and I have to reduce the time complexity of its execution. So, Unable to reduce file size. Do you suggest a better solution or alternative to this command?
logfile has 2-3 million lines and its data is like this:
21/11/02 10:05:53.906 | OUT | OUT | [.0772230000340600720676E00000003406 100210055390 121676570608000000NOH1N1AFRN00AFRN136220211102100553254IRT1AFRN000100676 20211102000000029700000003581320000001463900070 1 1 120211102100553 H110B 0300000000 184 202111020000000041 184980011 1849800118480208316 0000000000000000001 184-IR98001 080210 20211102085506 LJA1TSEDRHAHUB220000001463900 0000000000000 0000000000000000000000000.]
21/11/02 10:05:55.607 | OUT | IN | [.000899.]
21/11/02 10:06:00.711 | OUT | IN | [.000899.]
21/11/02 10:06:05.714 | OUT | IN | [.000899.]
21/11/02 10:06:06.014 | OUT | OUT | [.0772230000340700720676E00000003407 100210060601 121676574028000000NOH1N1SARV00SARV136220211102100605261IRT1SARV000100676 20211102000000100400000000992620000007140000070 1 1 120211102100605 H110B 0300000000 120 202111020000002132 120980011 1209800112080208316 0000000000000000001 120-IR98001 20211102100448 LJA1TSEDRHFHUB220000007140000 0000000000000 0000000000000000000000000.]
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
In some lines (like line 3,4,2) we have incomplete data. So we should look for the last line with complete data. there is no set rule that I can use to determine the exact number of lines in which complete data exist. There may not be complete data up to line 1000 and it will not return the correct output. (this is why tail
does not work)
P.S. Part of the code can be seen in this link : here
CodePudding user response:
With sed
in one single invocation:
lastline=$(sed -n '/^.\{100\}/h;${g;p}' "logs1/$file")
Each line with at least 100 characters is copied to the hold space. At the end of the log file we copy the hold space to the pattern space and we print the pattern space.
If this is not fast enough you'll probably need to use something else than sed
.
CodePudding user response:
Try this:
lastline=$(awk '(length>=100) {last=$0}; END {print last}' "logs1/$file")
Explanation: awk
can do all of this itself, only looking at each line once. It just records the latest line of 100 or more characters in the last
variable, and prints it at the end. It also reads directly from the file, avoiding the overhead of cat
.
I don't know for certain if this'll be faster or how much so; it may depend on what version of awk
you happen to have. But in principle it should be faster since it does less work on each line as it goes through the file.
If you really want it to be fast, I think you'd need to write something like a C program that seeks a ways before the end of file -- maybe a couple of thousand bytes -- and looks for a long line in just that last part of the file. If it doesn't find one, seek back a ways further, and try again.
CodePudding user response:
solution:
Print the last most line containing at least 100 characters:
grep '.\{100\}' | tail -n 1
It can also be done with a single sed
:
sed -ne '/.\{100\}/h' -e '${x;p}'
But the grep
will usually be faster than the sed
. Especially if using GNU grep
. It really depends on the grep
implementation though.
These rough benchmarks can illustrate the point:
GNU:
$ time grep '.\{100\}' /tmp/rand-lines | tail $ -n 1 >/dev/null
real 0m0.278s
user 0m0.345s
sys 0m0.000s
$ time sed -ne '/.\{100\}/h' -e '${x;p}' /tmp/rand-lines >/dev/null
real 0m10.590s
user 0m10.580s
sys 0m0.000s
GNU grep
, piped to tail -n 1
, is 50x faster than GNU sed
.
Busybox:
$ time busybox grep '.\{100\}' /tmp/rand-lines | tail -n 1 >/dev/null
real 0m10.340s
user 0m10.413s
sys 0m0.000s
$ time busybox sed -ne '/.\{100\}/h' -e '${x;p}' /tmp/rand-lines >/dev/null
real 0m10.588s
user 0m10.583s
sys 0m0.000s
On Busybox, which has a simpler grep
implementation, grep
still wins, but the difference is marginal.
The test file was 20,000 lines of random characters (printable ASCII spaces), containing 7058 lines that have at least 100 characters:
$ wc -l /tmp/rand-lines
20000 /tmp/rand-lines
$ grep -c '.\{100\}' /tmp/rand-lines
7058
$ head -n 1 /tmp/rand-lines
zJ_u)k_# K!-ZjR#x2{?>Xw3%xOx|):L^SV|=z&fEUJgn;oO9@[Wq[8I^UniwZ0q&CpL,n7]NI^WK7ke{t).=LFHXyI'Z$Dn!g ^ _,Hq<3X*f=>fm8=qYyh!WQUMo_,GLDPPy*N^.(G0!$; O9WcsSY
CodePudding user response:
we should look for the last line with complete data
Write a program that opens a file, seeks till the end, reads lines from the back of the file (by searching for a newline backwards in the file), then check if the line is "complete", and if it is, output the line and terminate the program.
sed
can't read the file from the end. Gluing commands together in a pipeline will result that the command on the left will over-read the file and push data to the pipe, which will cause a lot of unnecessary I/O.