How to parse logs ( nginx/apache access.log ) with mix of delimiters i.e. square bracket, space and-CodePudding

nginx access.log. It is delimited by 1) white space 2) [ ] and 3) double quotes.

::1 - - [12/Oct/2021:15:26:25  0530] "GET / HTTP/1.1" 200 1717 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
::1 - - [12/Oct/2021:15:26:25  0530] "GET /css/custom.css HTTP/1.1" 200 202664 "https://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

after parsing it supposed to look like

$1 = ::1

$4 = [12/Oct/2021:15:26:25 0530] or 12/Oct/2021:15:26:25 0530

$5 = "GET / HTTP/1.1"

$6 = 200

$7 = 1717

$8 = "-"

$9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

I tried some options like awk -F'[],] *' awk -f [][{}] , but they doesn't work with full line.

nginx access.log shared here is just an example. I am trying to understand how to parse with mix of such delimiters for usages in other complex logs.

CodePudding user response：

Since these are nginx logs so their format will be same(OR there are settings by which you can keep logs same, talking about current versions). We can take advantage of this feature moreover we can concentrate on only getting needed parts, so I am using regex here to get only matched values and leave not needed values simple. By following this we need NOT to hardcode the field numbers, using regex will do the trick here.

This should work in any awk version.

awk '
{
  while(match($0,/^::[0-9] |\[?[0-9]{1,2}\/[a-zA-Z]{3}\/[0-9]{4}(:[0-9]{2}){3}\s \ [0-9]{4}\]?|"[^"]*"|\s[0-9]{3}\s|[0-9] \s/)){
    val=substr($0,RSTART,RLENGTH)
    gsub(/^[[:space:]] |[[:space:]] $/,"",val)
    print val
    $0=substr($0,RSTART RLENGTH)
  }
}'  Input_file

CodePudding user response：

If you can use gnu-awk you can make use of FPAT to specify the column data:

awk -v FPAT='\\[[^][]*]|"[^"]*"|\\S ' '{
  for(i=1; i<=NF; i  ) {
    print "$"i" = ", $i
  }
}' file

The pattern matches:

\\[[^][]*] Match from an opening [ till closing ] using a negated character class
| Or
"[^"]*" Match from an opening till closing double quote
| Or
\\S 1 or more non whitespace chars

Output

$1 =  ::1
$2 =  -
$3 =  -
$4 =  [12/Oct/2021:15:26:25  0530]
$5 =  "GET / HTTP/1.1"
$6 =  200
$7 =  1717
$8 =  "-"
$9 =  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

CodePudding user response：

This might work for you (GNU sed):

sed -E 'y/ /\n/
       :a;s/^(\[[^]\n]*)\n/\1 /m;s/^("[^"\n]*)\n/\1 /m;ta
       s/.*/echo '\''&'\'' | cat -n/e
       s/^  *(\S)\t/$\1 = /mg' file

Replace all spaces by newlines.

Group all lines that begin and end in either [ and ] or double quotes and replace newlines by spaces.

Number all the lines.

Remove leading spaces and tabs and format the result.

CodePudding user response：

GNU awk

gawk '
    match($0, /([^[:blank:]] ) ([^[:blank:]] ) ([^[:blank:]] ) \[([^]] )\] "([^"] )" ([[:digit:]] ) ([[:digit:]] ) "([^"] )" "([^"] )"/, m) {
        for (i=1; i<=9; i  ) print i, m[i]
    }
' file

Or perl for more concise regexes

perl -nsE '
    if (/(\S ) (\S ) (\S ) \[(. ?)\] "(. ?)" (\d ) (\d ) "(. ?)" "(. ?)"/) {
        for $i (1..9) { say $i, $$i }
    }
' -- -,=" " file

Or with named captures, which would make it simpler to work with (but I'm reinventing the modules mentioned by @Shawn):

perl -MData::Dump=dd -nE '
    dd \%  if (/
        (?<host>\S ) \s
        (?<ident>\S ) \s
        (?<user>\S ) \s
        \[(?<timestamp>. ?)\] \s
        "(?<request>. ?)" \s
        (?<status>\d ) \s
        (?<size>\d ) \s
        "(?<referer>. ?)" \s
        "(?<user_agent>. ?)"
    /x)
' file

CodePudding user response：

parse access.log to json

awk -F'[][]' '{ print "remote_addr "$1 "local_time "$2 $3 }' access.log | awk -F'\"' '{ print $1   " method-&-path "$2 "  respStatus-&-byteSent " $3 " http_referer " $4 " http_agent " $6 } ' | awk  '{print " { \"remote_addr\" : \""$2"\" , \"local_time\" : \""$6 "\" , \"method\" : \""$9"\" , \"path\" : \""$10"\" , \"resp_status\" : \""$13"\" , \"bytes_sent\" : \""$14"\" , \"http_referer\" : \""$16"\" , \"http_agent\" : \""$18"   "$19" "$20" "$21" "$22" "$23" "$24" "$25" "$26" "$27" "$28" "$29"\"}"}'

Output

{ "remote_addr" : "::1" , "local_time" : "12/Oct/2021:15:26:25" , "method" : "GET" , "path" : "/css/custom.css" , "resp_status" : "200" , "bytes_sent" : "202664" , "http_referer" : "https://localhost/" , "http_agent" : "Mozilla/5.0   (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36  "}

parse access.log, display fields and values both

awk -F'[][]' '{ print "remote_addr "$1 "local_time "$2 $3 }' access.log | awk -F'\"' '{ print $1   " method-&-path "$2 "  respStatus-&-byteSent " $3 " http_referer " $4 " http_agent " $6 } ' | awk -F' ' '{print "remote_addr"$2", local_time "$6 ", method "$9", path "$10", resp_status "$13", bytes_sent "$14", http_referer "$16", http_agent "$18"  "$19" "$20" "$21" "$22" "$23" "$24" "$25" "$26" "$27" "$28" "$29 }'

Output

remote_addr::1, local_time 12/Oct/2021:15:26:25, method GET, path /, resp_status 200, bytes_sent 1717, http_referer -, http_agent Mozilla/5.0  (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36

CodePudding user response：

I am trying to understand how to parse with mix of such delimiters for usages in other complex logs.

Setting just Field Separator in GNU AWK will not suffice for this case, take look at

::1 - - [12/Oct/2021:15:26:25  0530] "GET / HTTP/1.1" 200 1717 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
::1 - - [12/Oct/2021:15:26:25  0530] "GET /css/custom.css HTTP/1.1" 200 202664 "https://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

spaces which are inside [ and ] are not separators
spaces which are inside " are not separators
all other spaces are separators

So far I know it is impossible to craft such pattern suitable for Field Separator which would correctly detect only spaces which are separators.