how to multi line regex match each distinct entry of a log file-CodePudding

For a log file, I'm trying to get a match for each distinct entry even if it spans multiple lines. Each distinct entry will begin with a timestamp even if there are multiple lines pertaining to the entry.

Here is my log file:

2000-01-01 01:01:01 UTC This is a 2 line sentence.
This is the second line
2000-01-01 01:01:02 UTC some random text on 1 line
2000-01-01 01:01:03 UTC This is a much longer 1 line sentence that manages to wrap itself around because of its length
2022-01-01 01:01:04 UTC This multi line paragraph has a few blank lines in between lines of text
           words words words and some numbers12345

a few more words
more words on another line and the next line might be blank

2000-01-01 01:01:05 UTC some random text on 1 line
2000-01-01 06:01:06 UTC This multi line paragraph has a few blank lines in between lines of text
           words words words and some numbers678910

a few more words
more words on another line and the next line might be blank

2000-01-01 01:01:07 UTC some random text on one line

I'm trying to match essentially any line that does not begin with a timestamp.

This works well as a base, but it won't grab any entry that spans multiple lines:
^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}[.][0-9] UTC [[][0-9] []]: [[][0-9] [-][0-9] []] . \n)

I've tried adding to it to do a negative lookahead to try and get each distinct entry as a match like so, but it's not right and I get even less matches: ^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC . \n)(. \n)*(?!([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC))

Is there a way to construct a regex to grab each distinct entry?

CodePudding user response：

Your first example seems to take milliseconds into account, which I don't see in your logs.

You could do with a positive lookahead:

^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC) (.*?)(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}|\z)

It grabs the log text until it encounters another timestamp, or the end of the input (\z), and captures the timestamp and log entry separately.

Regex101

CodePudding user response：

From your first Regex, I do not understand why you are using [[][0-9] []]: [[][0-9] [-][0-9] []] . \n after UTC and what [.][0-9] should be good for.

However, this is how you could make it work with Negative Lookahead:

^(?![0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC).*

So it will ignore lines which start with a timestamp until UTC.

See the result