I have a log that contains lines from processes that handle authentications. Such an authentication usually consists of 4 lines. For example:
2022-09-21T00:02:45 [18633]: connection opened
2022-09-21T00:02:45 [3711]: connection opened
2022-09-21T00:02:45 [61811]: connection opened
2022-09-21T00:02:45 [9957]: connection opened
2022-09-21T00:02:46 [3711]: authentication attempt
2022-09-21T00:02:46 [61811]: authentication attempt
2022-09-21T00:02:49 [3711]: user FOO authentication failure
2022-09-21T00:02:51 [18633]: authentication attempt
2022-09-21T00:02:51 [9957]: authentication attempt
2022-09-21T00:03:01 [9957]: user QOZ authentication failure
2022-09-21T00:03:02 [3711]: connection closed
2022-09-21T00:03:34 [61811]: user BAR authentication success
2022-09-21T00:03:38 [61811]: connection closed
2022-09-21T00:03:45 [18633]: user BAZ authentication success
2022-09-21T00:03:45 [18633]: connection closed
2022-09-21T00:04:11 [9957]: connection closed
I would like to find all the logs that deal with a successful authentication. I can do this by finding the PIDs of those, and then grepping the file again for all lines that contain that PID:
for pid in $(cat log.txt | sed -En 's/.*\[([0-9] )\].*success/\1/p'); do
grep $pid log.txt
done
This works:
2022-09-21T00:02:45 [61811]: connection opened
2022-09-21T00:02:46 [61811]: authentication attempt
2022-09-21T00:03:34 [61811]: user BAR authentication success
2022-09-21T00:03:38 [61811]: connection closed
2022-09-21T00:02:45 [18633]: connection opened
2022-09-21T00:02:51 [18633]: authentication attempt
2022-09-21T00:03:45 [18633]: user BAZ authentication success
2022-09-21T00:03:45 [18633]: connection closed
But it is very inefficient as there are many events, and I'm grepping the entire file for all of them. On a multi GB file this is undoable.
I'm looking for an alternative way to do this. Is there a way to (perhaps in awk
?) to do something like this?
- Match lines with
authentication success
and extract the PID - Search in the neighbourhood (say 10 lines before and after the match) for lines with that PID, and return all of them
many thx
CodePudding user response:
See if this works:
awk -F'[][]' '/success/{a[$2]; for(k in p) if(p[k]==$2) print h[k]}
$2 in a{print}
{i=NR; p[i]=$2; h[i]=$0}' log.txt
-F'[][]'
use[
and]
as field separators, so that$2
gives the PID{i=NR; p[i]=$2; h[i]=$0}
this saves PID in arrayp
and corresponding line inh
(number13
determines how many previous lines are saved)/success/{a[$2]; for(k in p) if(p[k]==$2) print h[k]}
when a new success PID is found, save that PID in arraya
and print matching results based on previously saved data in arrayp
andh
$2 in a{print}
this will print lines having successful PID from current line onwards
CodePudding user response:
I suggest taking look at Context Line Control of grep
, more precisely using --context=NUM
option for preprocessing, which should allows you to extract interesting subset of your text file, thus providing smaller easier to process file.
Consider following simple example, let file.txt
content be
Able
Baker
Charlie
Charlie
Dog
Easy
Fox
then
grep --context=1 'Charlie' file.txt
gives output
Baker
Charlie
Charlie
Dog
As you can see 1 line before and after is included.
(tested in GNU grep 3.1)
CodePudding user response:
Consider something like this instead:
$ sort -k2,2 -k1,1 file |
awk '
$2 != prev { if (rec ~ /success/) print rec; rec=""; prev=$2 }
{ rec = rec $0 ORS }
END { if (rec ~ /success/) print rec }
'
2022-09-21T00:02:45 [18633]: connection opened
2022-09-21T00:02:51 [18633]: authentication attempt
2022-09-21T00:03:45 [18633]: connection closed
2022-09-21T00:03:45 [18633]: user BAZ authentication success
2022-09-21T00:02:45 [61811]: connection opened
2022-09-21T00:02:46 [61811]: authentication attempt
2022-09-21T00:03:34 [61811]: user BAR authentication success
2022-09-21T00:03:38 [61811]: connection closed
or if you wanted the output ordered by the success line's timestamp rather than the PID you could do:
sort -k2,2 -k1,1 file |
awk -v OFS='\t' '
$2 != prev { prt(); numLines=0; time=""; prev=$2 }
/success/ { time = $1 }
{ rec[ numLines] = $0 }
END { prt() }
function prt( i) {
if (time) {
for (i=1; i<=numLines; i ) {
print time, prev, rec[i]
}
}
}
' |
sort -k1,1 -k2,2 |
cut -f2-
2022-09-21T00:02:45 [61811]: connection opened
2022-09-21T00:02:46 [61811]: authentication attempt
2022-09-21T00:03:34 [61811]: user BAR authentication success
2022-09-21T00:03:38 [61811]: connection closed
2022-09-21T00:02:45 [18633]: connection opened
2022-09-21T00:02:51 [18633]: authentication attempt
2022-09-21T00:03:45 [18633]: connection closed
2022-09-21T00:03:45 [18633]: user BAZ authentication success