How can I find 2 patterns in a log file on different lines but only if they are a matching pair (sta-CodePudding

To start, I have done a lot of googling and tried grep, sed and awk, but was not able to find a solution with either.

My goal is to find 2 patterns in a log file, but only if each first pattern has a matching second pattern (with no matching first patterns between them). After that I will compare the timestamps on both to calculate the time between, but that part is not where I am stuck.

Almost every solution I found via google left me with all of the lines between the start and end (which I do not need) or they left me with the first start match and the first end match (with multiple starts in between)

Example of the kind of text I am working with:

2022-09-10 20:17:05.552 [INFO] Starting process
2022-09-10 20:17:05.554 [INFO] junk here
2022-09-10 20:24:02.664 [INFO] junk here
2022-09-10 20:24:02.666 [INFO] Starting process
2022-09-10 20:30:57.526 [INFO] Starting process
2022-09-10 20:30:57.529 [INFO] Ending process
2022-09-10 20:37:55.122 [INFO] Starting process
2022-09-10 20:37:55.126 [INFO] Ending process
2022-09-10 20:44:50.352 [INFO] junk here

I want to find the lines with "Starting process" and then "Ending process" but with no "Starting process" between them (multiple starts without an end are failed attempts and I only need the ones that completed). The example has multiple failed starts, but only 2 starts that completed: Lines 5-6 and 7-8

Expected output:

2022-09-10 20:30:57.526 [INFO] Starting process
2022-09-10 20:30:57.529 [INFO] Ending process
2022-09-10 20:37:55.122 [INFO] Starting process
2022-09-10 20:37:55.126 [INFO] Ending process

Actually, the only output I really need would be:

2022-09-10 20:30:57.526
2022-09-10 20:30:57.529
2022-09-10 20:37:55.122
2022-09-10 20:37:55.126

(because my only need for these lines is to get the start and end time to calculate average time for this task when it completes)

I am willing to use most command line methods available via bash on Ubuntu (this is for a windows machine with WSL), so sed/awk/grep and possibly even perl are fine.

CodePudding user response：

Here's a solution with awk for getting the matching dates:

awk -F ' \\[[^[]*] ' '
    $2 == "Starting process" { d = $1 }
    $2 == "Ending process" && d != "" { print d, $1 ; d = "" }
'

2022-09-10 20:30:57.526 2022-09-10 20:30:57.529
2022-09-10 20:37:55.122 2022-09-10 20:37:55.126

If you're using GNU awk then you can even calculate the time difference:

awk -F ' \\[[^[]*] ' '
    function date2time(d, _d) {
        _d = d
        gsub( /[T:-]/, " ", _d )
        return mktime(_d) substr(d, index(d,"."))
    }
    $2 == "Starting process" {
        t = date2time($1)
    }
    $2 == "Ending process" && t != "" {
        printf "%.03f\n", date2time($1) - t
        t = ""
    }
'

0.003
0.004

CodePudding user response：

you can try one routine like that:

#!/bin/bash
######################################################################
# Finds the matching pair of two patterns
# Arguments:
#  l_start_pattern - The first pattern
#  l_end_pattern - The second pattern
#  l_filename - Filename that we will run through each line of it
######################################################################
function find_matching_pair() {
  local l_start_pattern="$1"
  local l_end_pattern="$2"
  local l_filename="$3"

  local l_record

  local l_start_pattern_found=false
  local l_start_record=""

  if [[ -z "${l_filename}" ]]; then
    return
  fi

  while read l_record ; do

    if [[ ${l_record} =~ ^.*${l_start_pattern} ]]; then
      l_start_pattern_found=true
      l_start_record="${l_record}"
    fi

    if [[ ${l_record} =~ ^.*${l_end_pattern} ]] &&
       [[ ${l_start_pattern_found} == true ]]; then
      echo "${l_start_record}"
      echo "${l_record}"
      l_start_pattern_found=false
      l_start_record=""
    fi

  done < ${l_filename}
}

#
# Calls the routine and cut the output at the '[' delimiter,
# since you just need the timestamp
#
find_matching_pair "Starting process" "Ending process" "file.txt" | \
  cut -d'[' -f 1

CodePudding user response：

This might work for you (GNU sed):

sed -nE '/ \[.*Starting.*/,/ \[.*Ending.*/{s///p}' file

Match a range, between Starting and Ending and print only the first and last lines amended.

Or alternatively:

sed -nE '/Starting/{:a;N;/Ending/!ba;s/ \[.*\n(.*) \[.*/\n\1/p}' file

CodePudding user response：

Using GNU sed

$ sed -En '/Starting process/t;{N;/Ending process/s/\[[^\n]*//gp}' input_file
2022-09-10 20:30:57.526 
2022-09-10 20:30:57.529 
2022-09-10 20:37:55.122 
2022-09-10 20:37:55.126