Home > Blockchain >  Regex behaviour is different on terminal from online validators
Regex behaviour is different on terminal from online validators

Time:11-12

I made a regex to extract values from a templated string. The regex is working smooth on websites like regexr.com but it's failing when I try to run in shell.

As example, let's use those lines:

[2022-11-11T12:07:00.789Z] "GET /check?subject=johnbegucci HTTP/1.1" 200 - "-" 0 17 3 2 "-" "-" "4e4c4fb1-a4d8-4075-8e42-b5fb9216f863" "laundry.transaction.svc.cluster.local:4466" "172.16.107.246:4466" outbound|4466||laundry.transaction.svc.cluster.local 172.16.67.246:51630 10.100.111.246:4466 172.16.67.246:48610 - default

[2022-11-11T13:31:41.189Z] "GET /v1/campaign/198237-jsd-1231 HTTP/1.1" 200 - "-" 0 674 63 63 "-" "Apache-HttpClient/4.5.10 (Java/11.0.7)" "9b3afd5b-c092-4e84-9f29-6380b7f2cafc" "mkt-extractor.mkt-extractor" "172.16.108.138:80" outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local 172.16.65.24:57134 10.100.19.249:80 172.16.65.24:38816 - default

Both lines follows the pattern:

[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" %UPSTREAM_CLUSTER% %UPSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_LOCAL_ADDRESS% %DOWNSTREAM_REMOTE_ADDRESS% %REQUESTED_SERVER_NAME%\n

Based on that, I created this regex to extract values from UPSTREAM_HOST. Values like outbound|4466||laundry.transaction.svc.cluster.local:

(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) 

I have tested this regex on website regexr.com and it displays right values as group 14 for both lines:

outbound|4466||laundry.transaction.svc.cluster.local
outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local 

After that, I tried to execute an awk -v FPAT but the groups looks wrong. To get values from UPSTREAM_HOST, I need to change print value and it's not viable because I'm creating an automation to process log:

echo '[2022-11-11T12:07:00.789Z] "GET /check?subject=johnbegucci HTTP/1.1" 200 - "-" 0 17 3 2 "-" "-" "4e4c4fb1-a4d8-4075-8e42-b5fb9216f863" "laundry.transaction.svc.cluster.local:4466" "172.16.107.246:4466" outbound|4466||laundry.transaction.svc.cluster.local 172.16.67.246:51630 10.100.111.246:4466 172.16.67.246:48610 - default' | awk -v FPAT='(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) ' -v OFS='|' '{print $15}'

# above example im using '{print $15}'

echo '[2022-11-11T13:31:41.189Z] "GET /v1/campaign/198237-jsd-1231 HTTP/1.1" 200 - "-" 0 674 63 63 "-" "Apache-HttpClient/4.5.10 (Java/11.0.7)" "9b3afd5b-c092-4e84-9f29-6380b7f2cafc" "mkt-extractor.mkt-extractor" "172.16.108.138:80" outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local 172.16.65.24:57134 10.100.19.249:80 172.16.65.24:38816 - default' | | awk -v FPAT='(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) ' -v OFS='|' '{print $18}'
 
# above example im using '{print $18}'

Is there any way to make it work for both logs with same print position?

CodePudding user response:

You don't need such complex regex to parse log entries shown in question.

Consider this awk solution with a simpler regex:

awk -v FPAT='\\[[^]]*]|"[^"]*"|\\S ' '
{
   for (i=1; i<=NF;   i) print NR ":" i "::"$i
}' file.log

1:1::[2022-11-11T12:07:00.789Z]
1:2::"GET /check?subject=johnbegucci HTTP/1.1"
1:3::200
1:4::-
1:5::"-"
1:6::0
1:7::17
1:8::3
1:9::2
1:10::"-"
1:11::"-"
1:12::"4e4c4fb1-a4d8-4075-8e42-b5fb9216f863"
1:13::"laundry.transaction.svc.cluster.local:4466"
1:14::"172.16.107.246:4466"
1:15::outbound|4466||laundry.transaction.svc.cluster.local
1:16::172.16.67.246:51630
1:17::10.100.111.246:4466
1:18::172.16.67.246:48610
1:19::-
1:20::default
2:1::[2022-11-11T13:31:41.189Z]
2:2::"GET /v1/campaign/198237-jsd-1231 HTTP/1.1"
2:3::200
2:4::-
2:5::"-"
2:6::0
2:7::674
2:8::63
2:9::63
2:10::"-"
2:11::"Apache-HttpClient/4.5.10 (Java/11.0.7)"
2:12::"9b3afd5b-c092-4e84-9f29-6380b7f2cafc"
2:13::"mkt-extractor.mkt-extractor"
2:14::"172.16.108.138:80"
2:15::outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local
2:16::172.16.65.24:57134
2:17::10.100.19.249:80
2:18::172.16.65.24:38816
2:19::-
2:20::default

I have printed like this to show you each and every field in each record.

CodePudding user response:

Assuming you are happy with that regex, you can use Perl to execute it:

s1='[2022-11-11T12:07:00.789Z] "GET /check?subject=johnbegucci HTTP/1.1" 200 - "-" 0 17 3 2 "-" "-" "4e4c4fb1-a4d8-4075-8e42-b5fb9216f863" "laundry.transaction.svc.cluster.local:4466" "172.16.107.246:4466" outbound|4466||laundry.transaction.svc.cluster.local 172.16.67.246:51630 10.100.111.246:4466 172.16.67.246:48610 - default'

s2='[2022-11-11T13:31:41.189Z] "GET /v1/campaign/198237-jsd-1231 HTTP/1.1" 200 - "-" 0 674 63 63 "-" "Apache-HttpClient/4.5.10 (Java/11.0.7)" "9b3afd5b-c092-4e84-9f29-6380b7f2cafc" "mkt-extractor.mkt-extractor" "172.16.108.138:80" outbound|80||mkt-extractor.mkt-extractor.svc.cluster.local 172.16.65.24:57134 10.100.19.249:80 172.16.65.24:38816 - default'


echo "$s1" | perl -lnE 'say $14 if /(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) /'
"172.16.107.246:4466"

echo "$s2" | perl -lnE 'say $14 if /(\[.*\])\s(\".*\")\s([0-9]*)\s(.*)\s(\".*\")\s([0-9]*)\s([0-9]*)\s([0-9]*)\s([0-9]*)\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(\".*\")\s(.*)\s(.*)\s(.*)\s(.*)\s(.*)\s(.*) /'
"172.16.108.138:80"
  • Related