This question is a follow-up to my previous question here: Parsing text and JSON from a log file and keeping them together
I have a log file, your_file.txt
with the following structure and I would like to extract the timestamp, run, user, and json:
A whole bunch of irrelevant text
2022-12-15 12:45:06 garbage, run: 1, user: james json:
[{"value": 30, "error": 8}]
Another stack user was helpful enough to provide this abridged code to extract the relevant pieces:
import re
pat = re.compile(
r'(?ms)^([^,\n] ),\s*run:\s*(\S ),\s*user:\s*(.*?)\s*json:\n(.*?)$'
)
with open('your_file.txt', 'r') as f_in:
print(pat.findall(f_in.read()))
Which returns this value which is then processed further:
[('2022-12-15 12:45:06 garbage', '1', 'james', '[{"value": 30, "error": 8}]')]
How can I amend the regex expression used to ignore the word "garbage" after the timestamp so that word is not included in the output of pat.findall
?
CodePudding user response:
You can use the date time pattern to match date time first and then the rest of the substring before ,
:
(?ms)^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})[^,\n]*,\s*run:\s*(\S ),\s*user:\s*(.*?)\s*json:\n(.*?)$
See the regex demo.
The ([^,\n] )
is replaced with (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})[^,\n]*
that matches
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})
- Group 1: four digits, two occurrences of-
and then two digits, a space, two digits, and then two occurrences of:
and then two digits[^,\n]*
- zero or more chars other than a comma and newline