I made a regex to extract values from a templated string. The regex is working smooth on websites like regexr.com but it's failing when I try to run in shell.
As example, let's use those lines:
[2022-11-11T12:07:00.789Z] "GET /check?subject=johnbegucci HTTP/1.1" 200 - "-" 0 17 3 2 "-" "-" "4e4c4fb1-a4d8-4075-8e42-b5fb9216f863" "laundry.transaction.svc.cluster.local:4466" "172.16.107.246:4466" outbound|4466||laundry.transaction.svc.cluster.local 172.16.67.246:51630 10.100.111.246:4466 172.16.67.246:48610 - default
[2022-11-11T13:31:41.189Z] "GET /v1/campaign/198237-jsd-1231 HTTP/1.1" 200 - "-" 0 674 63 63 "-" "Apache-HttpClient/4.5.10 (Java/11.0.7)" "9b3afd5b-c092-4e84-9f29-6380b7f2cafc" "mkt-extractor.mkt-extractor" "172.16.108.138:80" outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local 172.16.65.24:57134 10.100.19.249:80 172.16.65.24:38816 - default
Both lines follows the pattern:
[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" %UPSTREAM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME%\n
Based on that, I created this regex to extract values from UPSTREAM_HOST
. Values like outbound|4466||laundry.transaction.svc.cluster.local
:
(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)
I have tested this regex on website regexr.com and it displays right values as group 14 for both lines:
outbound|4466||laundry.transaction.svc.cluster.local
outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local
After that, I tried to execute an awk -v FPAT
but the groups looks wrong. To get values from UPSTREAM_HOST
, I need to change print value and it's not viable because I'm creating an automation to process log:
echo '[2022-11-11T12:07:00.789Z] "GET /check?subject=johnbegucci HTTP/1.1" 200 - "-" 0 17 3 2 "-" "-" "4e4c4fb1-a4d8-4075-8e42-b5fb9216f863" "laundry.transaction.svc.cluster.local:4466" "172.16.107.246:4466" outbound|4466||laundry.transaction.svc.cluster.local 172.16.67.246:51630 10.100.111.246:4466 172.16.67.246:48610 - default' | awk -v FPAT='(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) ' -v OFS='|' '{print $15}'
# above example im using '{print $15}'
echo '[2022-11-11T13:31:41.189Z] "GET /v1/campaign/198237-jsd-1231 HTTP/1.1" 200 - "-" 0 674 63 63 "-" "Apache-HttpClient/4.5.10 (Java/11.0.7)" "9b3afd5b-c092-4e84-9f29-6380b7f2cafc" "mkt-extractor.mkt-extractor" "172.16.108.138:80" outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local 172.16.65.24:57134 10.100.19.249:80 172.16.65.24:38816 - default' | | awk -v FPAT='(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) ' -v OFS='|' '{print $18}'
# above example im using '{print $18}'
Is there any way to make it work for both logs with same print
position?
CodePudding user response:
You don't need such complex regex to parse log entries shown in question.
Consider this awk
solution with a simpler regex:
awk -v FPAT='\\[[^]]*]|"[^"]*"|\\S ' '
{
for (i=1; i<=NF; i) print NR ":" i "::"$i
}' file.log
1:1::[2022-11-11T12:07:00.789Z]
1:2::"GET /check?subject=johnbegucci HTTP/1.1"
1:3::200
1:4::-
1:5::"-"
1:6::0
1:7::17
1:8::3
1:9::2
1:10::"-"
1:11::"-"
1:12::"4e4c4fb1-a4d8-4075-8e42-b5fb9216f863"
1:13::"laundry.transaction.svc.cluster.local:4466"
1:14::"172.16.107.246:4466"
1:15::outbound|4466||laundry.transaction.svc.cluster.local
1:16::172.16.67.246:51630
1:17::10.100.111.246:4466
1:18::172.16.67.246:48610
1:19::-
1:20::default
2:1::[2022-11-11T13:31:41.189Z]
2:2::"GET /v1/campaign/198237-jsd-1231 HTTP/1.1"
2:3::200
2:4::-
2:5::"-"
2:6::0
2:7::674
2:8::63
2:9::63
2:10::"-"
2:11::"Apache-HttpClient/4.5.10 (Java/11.0.7)"
2:12::"9b3afd5b-c092-4e84-9f29-6380b7f2cafc"
2:13::"mkt-extractor.mkt-extractor"
2:14::"172.16.108.138:80"
2:15::outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local
2:16::172.16.65.24:57134
2:17::10.100.19.249:80
2:18::172.16.65.24:38816
2:19::-
2:20::default
I have printed like this to show you each and every field in each record.
CodePudding user response:
Assuming you are happy with that regex, you can use Perl to execute it:
s1='[2022-11-11T12:07:00.789Z] "GET /check?subject=johnbegucci HTTP/1.1" 200 - "-" 0 17 3 2 "-" "-" "4e4c4fb1-a4d8-4075-8e42-b5fb9216f863" "laundry.transaction.svc.cluster.local:4466" "172.16.107.246:4466" outbound|4466||laundry.transaction.svc.cluster.local 172.16.67.246:51630 10.100.111.246:4466 172.16.67.246:48610 - default'
s2='[2022-11-11T13:31:41.189Z] "GET /v1/campaign/198237-jsd-1231 HTTP/1.1" 200 - "-" 0 674 63 63 "-" "Apache-HttpClient/4.5.10 (Java/11.0.7)" "9b3afd5b-c092-4e84-9f29-6380b7f2cafc" "mkt-extractor.mkt-extractor" "172.16.108.138:80" outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local 172.16.65.24:57134 10.100.19.249:80 172.16.65.24:38816 - default'
echo "$s1" | perl -lnE 'say $14 if /(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) /'
"172.16.107.246:4466"
echo "$s2" | perl -lnE 'say $14 if /(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) /'
"172.16.108.138:80"