How to extract JSON strings from text file?-CodePudding

I need to post-process text file which has normal text and JSON strings like example below:

Result.txt:

Name: Sham
good student
('{"result": {"School name": "abc", roll no": 2, "Class": "7B", "Marks": '
 '310.55, "class_average_percentage": 81.523, "percentage": 91'}}')

Name: Ram
need to improve
('{"result": {"School name": "abc", "roll no": 3, "Class": "8A", "Marks": '
 '250.89, "class_average_percentage": 86.23, "percentage": 68'}}')

Name: Roy
average student
('{"result": {"School name": "abc", "roll no": 9, "Class": "7B", "Marks": '
 '298.45, "class_average_percentage": 81.523, "percentage": 73'}}')

How to parse file to assign JSON strings to corresponding names?

Example:

Sham = ('{"School name": "abc", "result": {"roll no": 2, "Class": "7B", "Marks": '
        '310.55, "class_average_percentage": 81.523, "percentage": 91'}}')

CodePudding user response：

The JSON in the text file is not in the proper format. The ends of each of the supposed JSON format strings, such as … 91'}}' and … 68'}}' are invalid. It should be like this: … 91}}'.

Here's the corrected file:

Name: Sham
good student
('{"result": {"School name": "abc", "roll no": 2, "Class": "7B", "Marks": '
 '310.55, "class_average_percentage": 81.523, "percentage": 91}}')

Name: Ram
need to improve
('{"result": {"School name": "abc", "roll no": 3, "Class": "8A", "Marks": '
 '250.89, "class_average_percentage": 86.23, "percentage": 68}}')

Name: Roy
average student
('{"result": {"School name": "abc", "roll no": 9, "Class": "7B", "Marks": '
 '298.45, "class_average_percentage": 81.523, "percentage": 73}}')

I was able to parse the fixed version of your Result.txt shown above via something called a FSM (Finite-State-Machine) because they're able to parse patterns such as what's in the contents of this file. In a nutshell, they're are always in one of a finite number of states and change or transition from one state to another based on the input they're fed iteratively (such as lines of a text file).

To handle the JSON data, the code gathers the lines together comprising the JSON string which appears to be in Python literal string format, then it joins them all and applies ast.literal_eval() to them to obtain the string value. It then uses json.loads() to convert that into a Python dictionary, and then converts a nested dictionary in that intermediate result back into a JSON string using json.dumps() which it adds the student_info dictionary being built.

While at that point it would be possible to use the student's name to create a variable and assign the JSON string to it, that is generally considered a poor practice — so instead, the code creates a dictionary named student_info to hold the results. The keys of the dictionary are the student's name and value associated with each on it the JSON format string in the file.

from ast import literal_eval
import json


filepath = './Result.txt'
student_info = {}

with open(filepath) as file:
    state = 0
    while True:
        try:
            line = next(file)
        except StopIteration:  # End-Of-File
            break

        if not (line := line.strip()):  # Skip blank lines.
            continue

        if state == 0:
            if line.startswith('Name:'):  # First line of student data?
                name = line.split()[1]
                json_lines = []
                state = 1

        elif state == 1:
            if line.startswith('('):  # First JSON line?
                json_lines.append(line)
                state = 2

        elif state == 2:
            json_lines.append(line)
            if line.endswith(')'):  # Last JSON line?
                json_str = literal_eval(''.join(json_lines))
                result_dict = json.loads(json_str)
                as_json = json.dumps(result_dict['result'])
                student_info[name] = f'({repr(as_json)})'
                state = 0

for student, json_str in student_info.items():
    print(f'{student}: {json_str}')

Printed results:

Sham: ('{"School name": "abc", "roll no": 2, "Class": "7B", "Marks": 310.55, "class_average_percentage": 81.523, "percentage": 91}')
Ram: ('{"School name": "abc", "roll no": 3, "Class": "8A", "Marks": 250.89, "class_average_percentage": 86.23, "percentage": 68}')
Roy: ('{"School name": "abc", "roll no": 9, "Class": "7B", "Marks": 298.45, "class_average_percentage": 81.523, "percentage": 73}')

CodePudding user response：

The input file has a critical fault in that it's missing a double quote. So let's fix that giving:

Name: Sham
good student
('{"result": {"School name": "abc", "roll no": 2, "Class": "7B", "Marks": '
 '310.55, "class_average_percentage": 81.523, "percentage": 91'}}')

Name: Ram
need to improve
('{"result": {"School name": "abc", "roll no": 3, "Class": "8A", "Marks": '
 '250.89, "class_average_percentage": 86.23, "percentage": 68'}}')

Name: Roy
average student
('{"result": {"School name": "abc", "roll no": 9, "Class": "7B", "Marks": '
 '298.45, "class_average_percentage": 81.523, "percentage": 73'}}')

The structure of the file seems to be that the data are spread over groups of 5 lines. The first has the student name, the second is ignored, the significant data are split across two lines then there is a blank line.

The data are not valid JSON due to extraneous single-quotes but it looks as though they can simply be removed. This leads us to:

import json

name = None
result_dictionary = dict()
js = None

with open('Result.txt') as results:
    for line in map(str.strip, results):
        if line.startswith('Name'):
            _, name = line.split()
        elif line.startswith('('):
            js = line[2:]
        else:
            if js:
                js = (js   line[:-1]).replace("'", "")
                result_dictionary[name] = json.loads(js)
                js = None
                

print(json.dumps(result_dictionary, indent=2))

Output:

{
  "Sham": {
    "result": {
      "School name": "abc",
      "roll no": 2,
      "Class": "7B",
      "Marks": 310.55,
      "class_average_percentage": 81.523,
      "percentage": 91
    }
  },
  "Ram": {
    "result": {
      "School name": "abc",
      "roll no": 3,
      "Class": "8A",
      "Marks": 250.89,
      "class_average_percentage": 86.23,
      "percentage": 68
    }
  },
  "Roy": {
    "result": {
      "School name": "abc",
      "roll no": 9,
      "Class": "7B",
      "Marks": 298.45,
      "class_average_percentage": 81.523,
      "percentage": 73
    }
  }
}

This allows simple access to each student's data by name