Trying to build a flexible regex for parsing badly structured syslogs with many optional named capture groups.
Model:
<166>Jul 01 2022 11:34:34 some-hostname : %ASA-foobar: Built outbound TCP connection 1234566 for mpls:192.168.1.1/80 (192.168.1.1/80) to inside:8.8.8.8/443 (8.8.8.8/443)
Up to the first ipv4 and port (labeled srcip/srcport), the following expression works fine:
(?<hostname>\S ?)(\s)?:(\s)?\%ASA. ?(\sfor|\sfrom). ?(:|=)?\b(?<srcip>(?:[0-9]{1,3}\.){3}[0-9]{1,3})(\/|:)?((?<srcport>\d ))?. ?
https://regex101.com/r/ViuaY2/1
However, once I try to optionally capture ANY subsequent patterns, the matching stops working.
i.e. there may or may not be additional details in the log after the source ip/port information. If there is, I'd like to capture it.
Before addressing the more complex captures of the destination ip/port, trying to sort this out by getting the expression to optionally capture the to
pattern. This is not working:
(?<hostname>\S ?)(\s)?:(\s)?\%ASA. ?(\sfor|\sfrom). ?(:|=)?\b(?<srcip>(?:[0-9]{1,3}\.){3}[0-9]{1,3})(\/|:)?((?<srcport>\d ))?. ?(to)?
https://regex101.com/r/9aLDhc/1
Leading up to the last part of the expression, I've tried all of the following:
. ?((to)?)?
.*(to)? # this and similar ones just greedily capture everything until end of string, without a capture for `to`
.*?((to)?)?
Thanks in advance!
CodePudding user response:
If all is non greedy and there is nothing to process after it, the regex engine can settle for matching as least as possible.
You have to put the non greedy dot followed by actually matching to
as a whole inside a non capture group (?:.*? to\b)?
(?<hostname>\S ?)(\s)?:(\s)?\%ASA. ?(\sfor|\sfrom). ?(:|=)?\b(?<srcip>(?:[0-9]{1,3}\.){3}[0-9]{1,3})(\/|:)?((?<srcport>\d ))?(?:.*? to\b)?