Home > Mobile >  Regex to match / output yml documents in logfile containtaining specific string
Regex to match / output yml documents in logfile containtaining specific string

Time:10-07

I have a logfile I'm tailing and want to output only those yaml documents (separated by ---) containing a specific string (specific domain in hostname).

Example logfile contents:

(focus on the hostname)

---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: a.b.c.domain.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"
---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: a.b.c.different.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"
---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: 1.2.3.domain.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"

expected output:

---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: a.b.c.domain.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"
---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: 1.2.3.domain.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"

I cannot get my head around the regex I need. Matching every document (regardless of what's inside) I'm doing this:

/---\n[\s\S] ?(?=\n---|$)/g

see also: https://regex101.com/r/a8zKSz/2

However I cannot figure out how to only output those documents matching hostname with the domain domain.com (regex for the match within could be e.g. /hostname: .*?domain\.com/

I like to end up having a sed / perl or any other "oneliner" applicable on a "default linux OS". tail -F logfile.log | oneliner But getting the regex is the first step.

Any hints or help is appreciated.

CodePudding user response:

First of all, I have to say that regex are not the proper tool for this. If your input is Yaml, then use a tool made specially for Yaml.

For example, using yq, this can be done very easily:

cat example | yq eval 'select(.hostname | test ".domain.com")' -

Equivalently, for JSON inputs, there is jq.

Regex solution

Still, this is an interesting challenge, and might be cases where regexes are the most appropriate tool for the job. Here is a version that works.

Below, I wrote the pattern with added spacing, and split the regex on 4 lines to make it easier to read.

---\n
( (?!---|hostname:) [^\n] ? (\n|$) )*
hostname:[^\n] .domain.com (\n|$)
( (?!---|hostname:) [^\n] ? (\n|$) )*
(?=---|$)

The principle here is to write the pattern as an explicit state machine. A regex always describe a state machine, but we tend not to thing about it; but here, we want to make this very obvious.

  1. In the initial state, we look for a "yaml document start" marker (that is ---\n). When find such a line, we move to state #2.

  2. In state #2, we capture input lines (exactly one line at a time). We however refuse to capture a line that starts with 'hostname:' (which will force a transition to state #3) nor a line that starts with --- (which will force the engine to backtrack on step #1).

  3. In State #3, we capture a single line, starting with hostname:, but only if the rest of the line matches the expected domain. If such a line is captured, then we jump to state #4. If we can't match the line, then the engine can't continue (because of the negative lookahead in step #2) and will therefore backtrack on step #1).

  4. In State #4, we continue capturing input lines, until we reach the end of that document (that is, until we reach the next line matching '---\n').

Perl solution

Given that neither the yq solution nor the regex solution is viable in your situation, here is yet another approach, this time using perl (no external module required).

Once again, I format the code so that it is easier to understand, but this can easily be reduced to a single line.

perl -ne '
    if ($_ =~ /^---/) {
        if ($doc =~ /hostname: [^\n]*.domain.com/) {
            print ($doc);
        }
        $doc = $_;
    } else {
        $doc .= $_;
    }'

CodePudding user response:

Solution without regex in python. Consider your text in test.log

f=open('test.log','r')
contents=f.read().split('---')
for content in contents:
    if content:
        if '.domain.com' in content.splitlines()[4]:
            print(content)
  • Related