Parsing text and JSON from a log file and keeping them together-CodePudding

I have a .log file containing both text strings and json. For example:

A whole bunch of irrelevant text
2022-12-15 12:45:06, run: 1, user: james json:
[{"value": "30", "error": "8"}]
2022-12-15 12:47:36, run: 2, user: kelly json:
[{"value": "15", "error": "3"}]
More irrelevant text

My goal is to extract the json but keep it paired with the text that comes before it so that the two can be tied together. The keyword that indicates the start of a new section is run. However, as shown in the example below, I need to extract the timestamp from the same line where run appears. The character that indicates the end of a section is ].

My goal is to parse this text into a pandas dataframe like the following:

timestamp              run   user     value    error
2022-12-15 12:45:06    1     james    30       5
2022-12-15 12:47:36    2     kelly    15       8

CodePudding user response：

Try:

import re
import pandas as pd

pat = re.compile(
    r"(?ms)^([^,\n] ),\s*run:\s*(\S ),\s*user:\s*(.*?)\s*json:\n(.*?)$"
)

all_data = []
with open("your_file.txt", "r") as f_in:
    for timestamp, run, user, json_line in pat.findall(f_in.read()):
        json_line = json.loads(json_line)
        all_data.append(
            {
                "timestamp": timestamp,
                "run": run,
                "user": user,
                "value": json_line[0]["value"],
                "error": json_line[0]["error"],
            }
        )

df = pd.DataFrame(all_data)
print(df)

Prints:

             timestamp run   user  value  error
0  2022-12-15 12:45:06   1  james     30      5
1  2022-12-15 12:47:36   2  kelly     15      8

CodePudding user response：

To extract the json data and the timestamp from the text file, you can use Regex to search for the expression that indicates the start of a new section. In this case, the pattern you're looking for is the time followed by the keyword "run". In Python this would look like:

with open("file.log", "r") as f:
    text = f.read()

matches = re.findall(r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}), run: \d , user: \w  json:", text)

data = []

for match in matches:
    timestamp, run, user, value, error = re.search(r"^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}), run: (\d ), user: (\w ) json: \[{"value": (\d ), "error": (\d )}]$", match).groups()
    data.append((timestamp, int(run), user, int(value), int(error)))

# Tuples => DataFrame
df = pd.DataFrame(data, columns=["timestamp", "run", "user", "value", "error"])