nginx access.log. It is delimited by 1) white space 2) [ ] and 3) double quotes.
::1 - - [12/Oct/2021:15:26:25 0530] "GET / HTTP/1.1" 200 1717 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
::1 - - [12/Oct/2021:15:26:25 0530] "GET /css/custom.css HTTP/1.1" 200 202664 "https://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
after parsing it supposed to look like
$1 = ::1
$4 = [12/Oct/2021:15:26:25 0530] or 12/Oct/2021:15:26:25 0530
$5 = "GET / HTTP/1.1"
$6 = 200
$7 = 1717
$8 = "-"
$9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
I tried some options like awk -F'[],] *'
awk -f [][{}]
, but they doesn't work with full line.
nginx access.log shared here is just an example. I am trying to understand how to parse with mix of such delimiters for usages in other complex logs.
CodePudding user response:
I am trying to understand how to parse with mix of such delimiters for usages in other complex logs.
Setting just Field Separator in GNU AWK will not suffice for this case, take look at
::1 - - [12/Oct/2021:15:26:25 0530] "GET / HTTP/1.1" 200 1717 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
::1 - - [12/Oct/2021:15:26:25 0530] "GET /css/custom.css HTTP/1.1" 200 202664 "https://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
- spaces which are inside
[
and]
are not separators - spaces which are inside
"
are not separators - all other spaces are separators
So far I know it is impossible to craft such pattern suitable for Field Separator which would correctly detect only spaces which are separators.
CodePudding user response:
1) Handled "square bracket" and "double quotes" -> simplified
awk -F'[][]' '{ print "remote_addr "$1 "local_time "$2 $3 }' access.log | awk -F'\"' '{ print $1 " method-&-path "$2 " respStatus-&-byteSent " $3 " http_referer " $4 " http_agent " $6 }'
2) Parsed all fields, displayed as Field and Value
awk -F'[][]' '{ print "remote_addr "$1 "local_time "$2 $3 }' access.log | awk -F'\"' '{ print $1 " method-&-path "$2 " respStatus-&-byteSent " $3 " http_referer " $4 " http_agent " $6 } ' | awk '{print "remote_addr "$2", local_time "$6 ", method "$9", path "$10", resp_status "$13", bytes_sent "$14", http_referer "$16", http_agent "$18" "$19" "$20" "$21" "$22" "$23" "$24" "$25" "$26" "$27" "$28" "$29 }'
3) Parsed to json
awk -F'[][]' '{ print "remote_addr "$1 "local_time "$2 $3 }' access.log | awk -F'\"' '{ print $1 " method-&-path "$2 " respStatus-&-byteSent " $3 " http_referer " $4 " http_agent " $6 } ' | awk '{print " { \"remote_addr\" : \""$2"\" , \"local_time\" : \""$6 "\" , \"method\" : \""$9"\" , \"path\" : \""$10"\" , \"resp_status\" : \""$13"\" , \"bytes_sent\" : \""$14"\" , \"http_referer\" : \""$16"\" , \"http_agent\" : \""$18" "$19" "$20" "$21" "$22" "$23" "$24" "$25" "$26" "$27" "$28" "$29"\"}"}'
CodePudding user response:
GNU awk
gawk '
match($0, /([^[:blank:]] ) ([^[:blank:]] ) ([^[:blank:]] ) \[([^]] )\] "([^"] )" ([[:digit:]] ) ([[:digit:]] ) "([^"] )" "([^"] )"/, m) {
for (i=1; i<=9; i ) print i, m[i]
}
' file
Or perl for more concise regexes
perl -nsE '
if (/(\S ) (\S ) (\S ) \[(. ?)\] "(. ?)" (\d ) (\d ) "(. ?)" "(. ?)"/) {
for $i (1..9) { say $i, $$i }
}
' -- -,=" " file
Or with named captures, which would make it simpler to work with (but I'm reinventing the modules mentioned by @Shawn):
perl -MData::Dump=dd -nE '
dd \% if (/
(?<host>\S ) \s
(?<ident>\S ) \s
(?<user>\S ) \s
\[(?<timestamp>. ?)\] \s
"(?<request>. ?)" \s
(?<status>\d ) \s
(?<size>\d ) \s
"(?<referer>. ?)" \s
"(?<user_agent>. ?)"
/x)
' file
CodePudding user response:
If you can use gnu-awk
you can make use of FPAT to specify the column data:
awk -v FPAT='\\[[^][]*]|"([^"]*)"|\\S ' '{
for(i=1; i<=NF; i ) {
print "$"i" = ", $i
}
}' file
The pattern matches:
\\[[^][]*]
Match from an opening[
till closing]
using a negated character class|
or"([^"]*)"
|
Or\\S
1 or more non whitespace chars
Output
$1 = ::1
$2 = -
$3 = -
$4 = [12/Oct/2021:15:26:25 0530]
$5 = "GET / HTTP/1.1"
$6 = 200
$7 = 1717
$8 = "-"
$9 = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"