Home > Mobile >  Search log for regex pattern, then search surrounding lines again for previous match
Search log for regex pattern, then search surrounding lines again for previous match

Time:10-09

I have a log that contains lines from processes that handle authentications. Such an authentication usually consists of 4 lines. For example:

2022-09-21T00:02:45 [18633]: connection opened
2022-09-21T00:02:45 [3711]: connection opened
2022-09-21T00:02:45 [61811]: connection opened
2022-09-21T00:02:45 [9957]: connection opened
2022-09-21T00:02:46 [3711]: authentication attempt
2022-09-21T00:02:46 [61811]: authentication attempt
2022-09-21T00:02:49 [3711]: user FOO authentication failure
2022-09-21T00:02:51 [18633]: authentication attempt
2022-09-21T00:02:51 [9957]: authentication attempt
2022-09-21T00:03:01 [9957]: user QOZ authentication failure
2022-09-21T00:03:02 [3711]: connection closed
2022-09-21T00:03:34 [61811]: user BAR authentication success
2022-09-21T00:03:38 [61811]: connection closed
2022-09-21T00:03:45 [18633]: user BAZ authentication success
2022-09-21T00:03:45 [18633]: connection closed
2022-09-21T00:04:11 [9957]: connection closed

I would like to find all the logs that deal with a successful authentication. I can do this by finding the PIDs of those, and then grepping the file again for all lines that contain that PID:

for pid in $(cat log.txt | sed -En 's/.*\[([0-9] )\].*success/\1/p'); do
  grep $pid log.txt
done

This works:

2022-09-21T00:02:45 [61811]: connection opened
2022-09-21T00:02:46 [61811]: authentication attempt
2022-09-21T00:03:34 [61811]: user BAR authentication success
2022-09-21T00:03:38 [61811]: connection closed
2022-09-21T00:02:45 [18633]: connection opened
2022-09-21T00:02:51 [18633]: authentication attempt
2022-09-21T00:03:45 [18633]: user BAZ authentication success
2022-09-21T00:03:45 [18633]: connection closed

But it is very inefficient as there are many events, and I'm grepping the entire file for all of them. On a multi GB file this is undoable.

I'm looking for an alternative way to do this. Is there a way to (perhaps in awk?) to do something like this?

  1. Match lines with authentication success and extract the PID
  2. Search in the neighbourhood (say 10 lines before and after the match) for lines with that PID, and return all of them

many thx

CodePudding user response:

See if this works:

awk -F'[][]' '/success/{a[$2]; for(k in p) if(p[k]==$2) print h[k]}
              $2 in a{print}
              {i=NR; p[i]=$2; h[i]=$0}' log.txt
  • -F'[][]' use [ and ] as field separators, so that $2 gives the PID
  • {i=NR; p[i]=$2; h[i]=$0} this saves PID in array p and corresponding line in h (number 13 determines how many previous lines are saved)
  • /success/{a[$2]; for(k in p) if(p[k]==$2) print h[k]} when a new success PID is found, save that PID in array a and print matching results based on previously saved data in array p and h
  • $2 in a{print} this will print lines having successful PID from current line onwards

CodePudding user response:

I suggest taking look at Context Line Control of grep, more precisely using --context=NUM option for preprocessing, which should allows you to extract interesting subset of your text file, thus providing smaller easier to process file.

Consider following simple example, let file.txt content be

Able
Baker
Charlie
Charlie
Dog
Easy
Fox

then

grep --context=1 'Charlie' file.txt

gives output

Baker
Charlie
Charlie
Dog

As you can see 1 line before and after is included.

(tested in GNU grep 3.1)

CodePudding user response:

Consider something like this instead:

$ sort -k2,2 -k1,1 file |
    awk '
        $2 != prev { if (rec ~ /success/) print rec; rec=""; prev=$2 }
        { rec = rec $0 ORS }
        END { if (rec ~ /success/) print rec }
    '
2022-09-21T00:02:45 [18633]: connection opened
2022-09-21T00:02:51 [18633]: authentication attempt
2022-09-21T00:03:45 [18633]: connection closed
2022-09-21T00:03:45 [18633]: user BAZ authentication success

2022-09-21T00:02:45 [61811]: connection opened
2022-09-21T00:02:46 [61811]: authentication attempt
2022-09-21T00:03:34 [61811]: user BAR authentication success
2022-09-21T00:03:38 [61811]: connection closed

or if you wanted the output ordered by the success line's timestamp rather than the PID you could do:

sort -k2,2 -k1,1 file |
    awk -v OFS='\t' '
        $2 != prev { prt(); numLines=0; time=""; prev=$2 }
        /success/ { time = $1 }
        { rec[  numLines] = $0 }
        END { prt() }
        function prt(    i) {
            if (time) {
                for (i=1; i<=numLines; i  ) {
                    print time, prev, rec[i]
                }
            }
        }
    ' |
    sort -k1,1 -k2,2 |
    cut -f2-
2022-09-21T00:02:45 [61811]: connection opened
2022-09-21T00:02:46 [61811]: authentication attempt
2022-09-21T00:03:34 [61811]: user BAR authentication success
2022-09-21T00:03:38 [61811]: connection closed
2022-09-21T00:02:45 [18633]: connection opened
2022-09-21T00:02:51 [18633]: authentication attempt
2022-09-21T00:03:45 [18633]: connection closed
2022-09-21T00:03:45 [18633]: user BAZ authentication success
  • Related