Home > OS >  how to multi line regex match each distinct entry of a log file
how to multi line regex match each distinct entry of a log file

Time:08-15

For a log file, I'm trying to get a match for each distinct entry even if it spans multiple lines. Each distinct entry will begin with a timestamp even if there are multiple lines pertaining to the entry.

Here is my log file:

2000-01-01 01:01:01 UTC This is a 2 line sentence.
This is the second line
2000-01-01 01:01:02 UTC some random text on 1 line
2000-01-01 01:01:03 UTC This is a much longer 1 line sentence that manages to wrap itself around because of its length
2022-01-01 01:01:04 UTC This multi line paragraph has a few blank lines in between lines of text
           words words words and some numbers12345

a few more words
more words on another line and the next line might be blank

2000-01-01 01:01:05 UTC some random text on 1 line
2000-01-01 06:01:06 UTC This multi line paragraph has a few blank lines in between lines of text
           words words words and some numbers678910

a few more words
more words on another line and the next line might be blank

2000-01-01 01:01:07 UTC some random text on one line

I'm trying to match essentially any line that does not begin with a timestamp.

This works well as a base, but it won't grab any entry that spans multiple lines:
^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}[.][0-9] UTC [[][0-9] []]: [[][0-9] [-][0-9] []] . \n)

I've tried adding to it to do a negative lookahead to try and get each distinct entry as a match like so, but it's not right and I get even less matches: ^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC . \n)(. \n)*(?!([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC))

Is there a way to construct a regex to grab each distinct entry?

CodePudding user response:

Your first example seems to take milliseconds into account, which I don't see in your logs.

You could do with a positive lookahead:

^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC) (.*?)(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}|\z)

It grabs the log text until it encounters another timestamp, or the end of the input (\z), and captures the timestamp and log entry separately.

Regex101

CodePudding user response:

From your first Regex, I do not understand why you are using [[][0-9] []]: [[][0-9] [-][0-9] []] . \n after UTC and what [.][0-9] should be good for.

However, this is how you could make it work with Negative Lookahead:

^(?![0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC).*

So it will ignore lines which start with a timestamp until UTC.

See the result

  • Related