Goal
I am trying to craft a RegEx that will parse out specific data from various syslog entries that contain subtle differences in logged content. While I am able to accomplish my goal using multiple RegEx statements, if possible, I would like to combine these statements into a single consolidated RegEx.
Log entries
The main issue I'm having is that some log entries have a URL that needs to be parsed to a named group and other log entries do not have any URL. Examples of these two different log entries are provided below.
Entry with URL
Nov 3 11:33:04 host1 postfix/smtpd[12812]: NOQUEUE: reject: RCPT from 178.red-83-59-180.dynamicip.rima-tde.net[83.59.180.178]: 554 5.7.1 Service unavailable; Client host [83.59.180.178] blocked using b.barracudacentral.org; http://www.barracudanetworks.com/reputation/?pr=1&ip=83.59.180.178; [email protected] [email protected] proto=ESMTP helo=<178.red-83-59-180.dynamicip.rima-tde.net>
Entry without URL
Nov 2 16:01:25 host1 postfix/smtpd[31667]: NOQUEUE: reject_warning: RCPT from mail1.sendersrv.com[185.3.229.125]: 554 5.7.1 Service unavailable; Client host [185.3.229.125] blocked using bl.spamcop.net; from=bounces [email protected] [email protected] proto=ESMTP helo=<mail1.sendersrv.com>
RegEx statements
In the RegEx statements that follow, the first two are what I currently use for each of the previous log messages. The third RegEx is my attempt at consolidating these both into a single RegEx that will parse data from either log message. My attempt was to use a conditional statement that would basically check for the existence of http(s)
and if found, then to parse the URL to a named group. If http(s)
was not found, then it would parse out everything until the next RegEx token.
The issue is that when I test the RegEx against a log entry that has a URL, the RegEx does not seem to find http(s)
despite this token being set as optional (i.e. using the ?
quantifier). However, if I remove the ?
quantifier, it does find http(s)
and then parses the URL as desired. However, without the quantifier, the RegEx does not work with log entries that do not have a URL.
Parse entries with URL
^(?P<datetime>. ) host1 postfix. RCPT from (?P<srcDns>. )\[(?P<srcIp>[0-9\.] )\]:. blocked using (?P<blkList>. );. https?:\/{2}(?P<entryUrl>. );\s. \sto=\<(?P<destEm>. )>. $
Parse entries without URL
^(?P<datetime>. ) host1 postfix. RCPT from (?P<srcDns>. )\[(?P<srcIp>[0-9\.] )\]:. blocked using (?P<blkList>. );\s. \sto=\<(?P<destEm>. )>. $
Attempt at consolidating RegEx
^(?P<datetime>. ) host1 postfix. RCPT from (?P<srcDns>. )\[(?P<srcIp>[0-9\.] )\]:. blocked using (?P<blkList>. )(?<=[a-z]);. (https?:\/{2})?(?(5)(?P<entryUrl>. )|. )to=\<(?P<destEm>. )>. $
I'm sure the issue is my misunderstanding as to how the conditional statements and the ?
quantifier works. All suggestions are welcome and thanks in advance for your time.
CodePudding user response:
Have you tried to test your regex on page like regex101?
to=\<(?P<destEm>. )>
doesn't seem to match your examples. You should either remove <>
or replace to
with helo
. Be careful to make your quantifier lazy after blkList
otherwise you might catch too much text.
You can then make your url optional with ?
and it should work in both cases:
^(?P<datetime>. ) host1 postfix. RCPT from (?P<srcDns>. )\[(?P<srcIp>[0-9\.] )\]:. blocked using (?P<blkList>. ?);(. https?:\/{2}(?P<entryUrl>. );\s)?. \sto=(?P<destEm>. ?)\s.*$
CodePudding user response:
One approach would be to replace in the first regex . https?:\/{2}(?P<entryUrl>. );
with (?:. https?:\/{2}(?P<entryUrl>. );)?
where ?:
indicates that it is a non-capturing group and the ?
at the end means that it is optional.
However, it still does not work because .
is greedy, so use lazy . ?
instead.
Final regex:
^(?P<datetime>. ?) host1 postfix. ?RCPT from (?P<srcDns>. ?)\[(?P<srcIp>[0-9\.] )\]:. ?blocked using (?P<blkList>. ?);(?:. ?https?:\/{2}(?P<entryUrl>. ?);)?\s. ?\sto=\<(?P<destEm>. ?)>. ?$
https://regex101.com/r/QkmXWz (to see it in action)