Home > Enterprise >  Parsing a log file and ignoring text between two targets
Parsing a log file and ignoring text between two targets

Time:12-17

This question is a follow-up to my previous question here: Parsing text and JSON from a log file and keeping them together

I have a log file, your_file.txt with the following structure and I would like to extract the timestamp, run, user, and json:

A whole bunch of irrelevant text
2022-12-15 12:45:06 garbage, run: 1, user: james json:
[{"value": 30, "error": 8}]

Another stack user was helpful enough to provide this abridged code to extract the relevant pieces:

import re

pat = re.compile(
    r'(?ms)^([^,\n] ),\s*run:\s*(\S ),\s*user:\s*(.*?)\s*json:\n(.*?)$'
)

with open('your_file.txt', 'r') as f_in:
    print(pat.findall(f_in.read()))

Which returns this value which is then processed further:

[('2022-12-15 12:45:06 garbage', '1', 'james', '[{"value": 30, "error": 8}]')]

How can I amend the regex expression used to ignore the word "garbage" after the timestamp so that word is not included in the output of pat.findall?

CodePudding user response:

You can use the date time pattern to match date time first and then the rest of the substring before ,:

(?ms)^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})[^,\n]*,\s*run:\s*(\S ),\s*user:\s*(.*?)\s*json:\n(.*?)$

See the regex demo.

The ([^,\n] ) is replaced with (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})[^,\n]* that matches

  • (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) - Group 1: four digits, two occurrences of - and then two digits, a space, two digits, and then two occurrences of : and then two digits
  • [^,\n]* - zero or more chars other than a comma and newline
  • Related