Parsing a log file and ignoring text between two targets-CodePudding

This question is a follow-up to my previous question here: Parsing text and JSON from a log file and keeping them together

I have a log file, your_file.txt with the following structure and I would like to extract the timestamp, run, user, and json:

A whole bunch of irrelevant text
2022-12-15 12:45:06 garbage, run: 1, user: james json:
[{"value": 30, "error": 8}]

Another stack user was helpful enough to provide this abridged code to extract the relevant pieces:

import re

pat = re.compile(
    r'(?ms)^([^,\n] ),\s*run:\s*(\S ),\s*user:\s*(.*?)\s*json:\n(.*?)$'
)

with open('your_file.txt', 'r') as f_in:
    print(pat.findall(f_in.read()))

Which returns this value which is then processed further:

[('2022-12-15 12:45:06 garbage', '1', 'james', '[{"value": 30, "error": 8}]')]

How can I amend the regex expression used to ignore the word "garbage" after the timestamp so that word is not included in the output of pat.findall?

CodePudding user response：

You can use the date time pattern to match date time first and then the rest of the substring before ,:

(?ms)^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})[^,\n]*,\s*run:\s*(\S ),\s*user:\s*(.*?)\s*json:\n(.*?)$

See the regex demo.

The ([^,\n] ) is replaced with (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})[^,\n]* that matches

(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) - Group 1: four digits, two occurrences of - and then two digits, a space, two digits, and then two occurrences of : and then two digits
[^,\n]* - zero or more chars other than a comma and newline