How do I match multiline expressions with junk in the middle?-CodePudding

I'm trying to match a multiline expression from some logs we have. The biggest problem is due to race-conditions, we sometimes have to use a custom print function with a mutex, and sometimes (when that's not necessary) we just use printf. This results in two types of logs.

My solution was this monstrosity:

changed key '(\w )' value: <((([0-9a-f]{2} *) )(?:\n)*(?:<\d > \w (?:.*?] \[\d \])\s*)*)*>

Explanation of the above regex:

changed key '(\w )' value: - This is how we detect a print (and save the keyname in a capture group).
<{regex}> - The value output starts with < and ends with >
([0-9a-f]{2} *) - The bytes are hexadecimal pairs followed by an optional space (because last byte doesn't have a space). Let's call this capture group 4.
({group4} ) - One or more of group 4.
(?:\n)* - There can be 0 or more newlines after this "XX " pair. (non-capture)
(?:<\d > \w (?:.*?] \[\d \])\s*)* - There can be 0 or more prints of the timestamp. (non-capture)

This works for the Case 2 logs, but not for the Case 1 logs. In Case 1, for some reason only the last line is matched.

Essentially, I'm trying to match this (two capture groups):

changed key '(\w )' value: <({only hexadecimal pairs})>

group 1: key
group 2: value

Below is the dummy cases (same value in all cases):

// Case 1
<22213> Nov 30 00:00:00.287 [D1]  [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1]  [128]
<22213> Nov 30 00:00:00.287 [D1]  [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1]  [128]ff ff
<22213> Nov 30 00:00:00.287 [D1]  [128]00 00 00 11 00 00 00 00 00 21>

// Case 2
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21>

// Case 2 with some newlines in the middle
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 

04 00 00 ff  
ff 00 00 00 11 00 

00 00 00 00 21>

The key isn't always the same key, so the value (and the value length) can change.

CodePudding user response：

This approach starts by first stripping out the leading log content of each line, leaving behind the content you want to target. After that, it does an re.findall search using a regex pattern similar to the one you are already using.

inp = """<22213> Nov 30 00:00:00.287 [D1]  [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1]  [128]
<22213> Nov 30 00:00:00.287 [D1]  [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1]  [128]ff ff
<22213> Nov 30 00:00:00.287 [D1]  [128]00 00 00 11 00 00 00 00 00 21>"""
inp = re.sub(r'^<.*?>.*?(?:\s \[.*?\]) ', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w )' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], re.sub(r'\s ', ' ', x[1])) for x in matches]
print(matches)

This prints:

[('KEY_NAME', 'ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21')]

Assuming there could be unwanted values in between 'KEY_NAME' value: < and the closing >, we can use re.findall on the second group to match all hexadecimal values:

inp = re.sub(r'^<.*?>.*?(?:\s \[.*?\]) ', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w )' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], ' '.join(re.findall(r'\b[a-f0-9]{2}\b', x[1]))) for x in matches]
print(matches)  # output same as above