For a log file, I'm trying to get a match for each distinct entry even if it spans multiple lines. Each distinct entry will begin with a timestamp even if there are multiple lines pertaining to the entry.
Here is my log file:
2000-01-01 01:01:01 UTC This is a 2 line sentence.
This is the second line
2000-01-01 01:01:02 UTC some random text on 1 line
2000-01-01 01:01:03 UTC This is a much longer 1 line sentence that manages to wrap itself around because of its length
2022-01-01 01:01:04 UTC This multi line paragraph has a few blank lines in between lines of text
words words words and some numbers12345
a few more words
more words on another line and the next line might be blank
2000-01-01 01:01:05 UTC some random text on 1 line
2000-01-01 06:01:06 UTC This multi line paragraph has a few blank lines in between lines of text
words words words and some numbers678910
a few more words
more words on another line and the next line might be blank
2000-01-01 01:01:07 UTC some random text on one line
I'm trying to match essentially any line that does not begin with a timestamp.
This works well as a base, but it won't grab any entry that spans multiple lines:
^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}[.][0-9] UTC [[][0-9] []]: [[][0-9] [-][0-9] []] . \n)
I've tried adding to it to do a negative lookahead to try and get each distinct entry as a match like so, but it's not right and I get even less matches: ^([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC . \n)(. \n)*(?!([0-9]{4}[-][0-9]{2}[-][0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC))
Is there a way to construct a regex to grab each distinct entry?
CodePudding user response:
Your first example seems to take milliseconds into account, which I don't see in your logs.
You could do with a positive lookahead:
^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC) (.*?)(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}|\z)
It grabs the log text until it encounters another timestamp, or the end of the input (\z
), and captures the timestamp and log entry separately.
CodePudding user response:
From your first Regex, I do not understand why you are using [[][0-9] []]: [[][0-9] [-][0-9] []] . \n
after UTC
and what [.][0-9]
should be good for.
However, this is how you could make it work with Negative Lookahead:
^(?![0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} UTC).*
So it will ignore lines which start with a timestamp until UTC
.