I'm trying to match a multiline expression from some logs we have. The biggest problem is due to race-conditions, we sometimes have to use a custom print function with a mutex, and sometimes (when that's not necessary) we just use printf
. This results in two types of logs.
My solution was this monstrosity:
changed key '(\w )' value: <((([0-9a-f]{2} *) )(?:\n)*(?:<\d > \w (?:.*?] \[\d \])\s*)*)*>
Explanation of the above regex:
changed key '(\w )' value:
- This is how we detect a print (and save the keyname in a capture group).<{regex}>
- The value output starts with < and ends with >([0-9a-f]{2} *)
- The bytes are hexadecimal pairs followed by an optional space (because last byte doesn't have a space). Let's call this capture group 4.({group4} )
- One or more of group 4.(?:\n)*
- There can be 0 or more newlines after this "XX " pair. (non-capture)(?:<\d > \w (?:.*?] \[\d \])\s*)*
- There can be 0 or more prints of the timestamp. (non-capture)
This works for the Case 2 logs, but not for the Case 1 logs. In Case 1, for some reason only the last line is matched.
Essentially, I'm trying to match this (two capture groups):
changed key '(\w )' value: <({only hexadecimal pairs})>
group 1: key
group 2: value
Below is the dummy cases (same value in all cases):
// Case 1
<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>
// Case 2
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21>
// Case 2 with some newlines in the middle
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00
04 00 00 ff
ff 00 00 00 11 00
00 00 00 00 21>
The key isn't always the same key, so the value (and the value length) can change.
CodePudding user response:
This approach starts by first stripping out the leading log content of each line, leaving behind the content you want to target. After that, it does an re.findall
search using a regex pattern similar to the one you are already using.
inp = """<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>"""
inp = re.sub(r'^<.*?>.*?(?:\s \[.*?\]) ', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w )' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], re.sub(r'\s ', ' ', x[1])) for x in matches]
print(matches)
This prints:
[('KEY_NAME', 'ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21')]
Assuming there could be unwanted values in between 'KEY_NAME' value: <
and the closing >
, we can use re.findall
on the second group to match all hexadecimal values:
inp = re.sub(r'^<.*?>.*?(?:\s \[.*?\]) ', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w )' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], ' '.join(re.findall(r'\b[a-f0-9]{2}\b', x[1]))) for x in matches]
print(matches) # output same as above