Parse json data mixed with test in Python-CodePudding

I have a report that contains json objects mixed with plain text, and looks like this:

Payload
0x0000: some text
text
{
   "text1": {
      "text2": {
         "text3": "value3",
         "text4": "value4",
         "text5": "value5"
      },
      "text6": "value6",
      "text7": "value7"
   },
   "text8": "value8"
}
Payload 2
0x0001: some other text
other text
{
   "text1": {
      "text2": {
         "text3": "value3",
         "text4": "value4",
         "text5": "value5"
      },
      "text6": "value6",
      "text7": "value7"
   },
   "text8": "value8"
}

What I want to do is read the file, extract these json objects, and obtain specific values from each of them. The thing is that the report is huge, and the plain text between json objects does not contain the same words every time. What I've tried is to use json.loads(...) (which fails), and json.dumps(...) which does not ignore these plain text strings.

filedes = open(path, "r")
# Reading the whole file
text = filedes.read()
text = json.dumps(text)

Any ideas on how to parse this without removing manually these plain text lines?

CodePudding user response：

You could try looping through each line, and write to a separate file, starting when your line = { and stop after a line = }

CodePudding user response：

You need to parse the whole file and track the boundaries of the objects.

import json

values = []

with open('data.txt') as f:
  in_object = False
  data = None
  for line in f.readlines():
    line = line.strip()
    if line == "}":
      if data is not None:
        data.append(line)
        content = '\n'.join(data)
        values.append(json.loads(content))
      in_object = False
    elif line == "{":
      in_object = True
      data = [line]
    else:
      if in_object and data is not None:
        data.append(line)

print(f'Found {len(values)} values')
for v in values:
  print(v)

CodePudding user response：

Assuming the random text between objects will never be matched in those cases, during read, ignore all lines NOT starting with (I'm assuming that the json can also appear with brackets):

[
]
{
}
(empty space)

Example with regex: match ^[^\{\}\[\] ].*$ and replace with nothing.

^ matches beginning of line
[ opens a character class
^ in the beginning of a group matches everything that is NOT inside the group
\{\}\[\] , where every char is escaped with \, matches literals {}[] (empty space included). Since the character class is negated, matches everything that is NOT those items.
] closes the character class
.* matches everything after
$ matches to the end of the line

Possibly you'll need to escape the backlashes itself depending on how you're doing it (engine and input method combination).

However, bear in mind that a string with something like below is NOT a valid json yet.

{
   "text1": { ... },
   "text8": "value8"
}
{
   "text1": { ... },
   "text8": "value8"
}

You'll still need to either split each object in it's own object, or (as below) encapsulate all of them in an array and add commas between each object.

[{
   "text1": { ... },
   "text8": "value8"
},
{
   "text1": { ... },
   "text8": "value8"
}]

So, in the first regex, you can replace lines to be ignored with commas instead of nothing, then remove duplicate commas (replace ^,(\r?\n,) $ with ,. At the end, trim trailing and leading commas and add a leading [ and a trailing ].

^, matches lines starting with comma
( opens a group
\r?\n, matches newline (matches \r\n with \r being optional) followed by comma
) closes a group, but because of , all this group must be matched 1 or more times (so, after the first comma, newline comma can repeat indefinitely)
$ matches the end of string

An example in code (generated with help from regex101.com then adapted):

import re

regex_invalid_lines = r"^[^\{\}\[\] ].*$"
regex_duplicate_commas = r"^,(\r?\n,) $"

test_str = ("Payload\n"
    "0x0000: some text\n"
    "text\n"
    "{\n"
    "   \"text1\": {\n"
    "      \"text2\": {\n"
    "         \"text3\": \"value3\",\n"
    "         \"text4\": \"value4\",\n"
    "         \"text5\": \"value5\"\n"
    "      },\n"
    "      \"text6\": \"value6\",\n"
    "      \"text7\": \"value7\"\n"
    "   },\n"
    "   \"text8\": \"value8\"\n"
    "}\n"
    "Payload 2\n"
    "0x0001: some other text\n"
    "other text\n"
    "{\n"
    "   \"text1\": {\n"
    "      \"text2\": {\n"
    "         \"text3\": \"value3\",\n"
    "         \"text4\": \"value4\",\n"
    "         \"text5\": \"value5\"\n"
    "      },\n"
    "      \"text6\": \"value6\",\n"
    "      \"text7\": \"value7\"\n"
    "   },\n"
    "   \"text8\": \"value8\"\n"
    "}")


partial_result = re.sub(regex_invalid_lines, ",", test_str, 0, re.MULTILINE)
partial_result = re.sub(regex_duplicate_commas, ",", partial_result, 0, re.MULTILINE)

# without multiline to match literal beginning/end of string
# ^,|,$ matches both leading comma and trailing comma
partial_result = re.sub("^,|,$", "", partial_result)

result = f"[{partial_result}]"

if result:
    print (result)