How do I converted my textfile to a nested json in python-CodePudding

I have a text file which I want to convert to a nested json structure. The text file is :

Report_for Reconciliation
Execution_of application_1673496470638_0001
Spark_version 2.4.7-amzn-0
Java_version 1.8.0_352 (Amazon.com Inc.)
Start_time 2023-01-12 09:45:13.360000
Spark Properties: 
Job_ID 0
Submission_time 2023-01-12 09:47:20.148000
Run_time 73957ms
Result JobSucceeded
Number_of_stages 1
Stage_ID 0
Number_of_tasks 16907
Number_of_executed_tasks 16907
Completion_time 73207ms
Stage_executed parquet at RawDataPublisher.scala:53
Job_ID 1
Submission_time 2023-01-12 09:48:34.177000
Run_time 11525ms
Result JobSucceeded
Number_of_stages 2
Stage_ID 1
Number_of_tasks 16907
Number_of_executed_tasks 0
Completion_time 0ms
Stage_executed parquet at RawDataPublisher.scala:53
Stage_ID 2
Number_of_tasks 300
Number_of_executed_tasks 300
Completion_time 11520ms
Stage_executed parquet at RawDataPublisher.scala:53
Job_ID 2
Submission_time 2023-01-12 09:48:46.908000
Run_time 218358ms
Result JobSucceeded
Number_of_stages 1
Stage_ID 3
Number_of_tasks 1135
Number_of_executed_tasks 1135
Completion_time 218299ms
Stage_executed parquet at RawDataPublisher.scala:53

I want the output to be :

{
    "Report_for": "Reconciliation",
    "Execution_of": "application_1673496470638_0001",
    "Spark_version": "2.4.7-amzn-0",
    "Java_version": "1.8.0_352 (Amazon.com Inc.)",
    "Start_time": "2023-01-12 09:45:13.360000",
    "Job_ID 0": {
        "Submission_time": "2023-01-12 09:47:20.148000",
        "Run_time": "73957ms",
        "Result": "JobSucceeded",
        "Number_of_stages": "1",
        "Stage_ID 0”: {
            "Number_of_tasks": "16907",
            "Number_of_executed_tasks": "16907",
            "Completion_time": "73207ms",
            "Stage_executed": "parquet at RawDataPublisher.scala:53"
            "Stage": "parquet at RawDataPublisher.scala:53",
         },
     },
}

I tried defaultdict method but it was generating a json with values as list which was not acceptable to make a table on it. Here's what I did:

import json
from collections import defaultdict

INPUT = 'demofile.txt'
dict1 = defaultdict(list)

def convert():
    with open(INPUT) as f:
        for line in f:
            command, description = line.strip().split(None, 1)
            dict1[command].append(description.strip())
    OUTPUT = open("demo1file.json", "w")
    json.dump(dict1, OUTPUT, indent = 4, sort_keys = False)

and was getting this:

     "Report_for": [ "Reconciliation" ], 
     "Execution_of": [ "application_1673496470638_0001" ], 
     "Spark_version": [ "2.4.7-amzn-0" ], 
     "Java_version": [ "1.8.0_352 (Amazon.com Inc.)" ], 
     "Start_time": [ "2023-01-12 09:45:13.360000" ], 
      "Job_ID": [ 
           "0", 
           "1", 
           "2", ....
]]]

I just want to convert my text to the above json format so that I can build a table on top of it.

CodePudding user response：

There's no way, python or one of it's libraries can figure out your nesting requirements, if a flat text is being given as an input. How should it know Stages are inside Jobs...for example.

You will have to programmatically tell your application how it works.

I hacked an example which should work, you can go from there (assuming input_str is what you posted as your file content):

# define your nesting structure
nesting = {'Job_ID': {'Stage_ID': {}}}
upper_nestings = []
upper_nesting_keys = []

# your resulting dictionary
result_dict = {}

# your "working" dictionaries
current_nesting = nesting
working_dict = result_dict

# parse each line of the input string
for line_str in input_str.split('\n'):

    # key is the first word, value are all consecutive words    
    line = line_str.split(' ')

    # if key is in nesting, create new sub-dict, all consecutive entries are part of the sub-dict        
    if line[0] in current_nesting.keys():
        current_nesting = current_nesting[line[0]]
        upper_nestings.append(line[0])
        upper_nesting_keys.append(line[1])
        working_dict[line_str] = {}
        working_dict = working_dict[line_str]

    else:
        # if a new "parallel" or "upper" nesting is detected, reset your nesting structure
        if line[0] in upper_nestings:
            nests = upper_nestings[:upper_nestings.index(line[0])]
            keys = upper_nesting_keys[:upper_nestings.index(line[0])]
        
            working_dict = result_dict
            for nest in nests:
                working_dict = working_dict[' '.join([nest, keys[nests.index(nest)]])]
    
            upper_nestings = upper_nestings[:upper_nestings.index(line[0]) 1]
        
            upper_nesting_keys = upper_nesting_keys[:upper_nestings.index(line[0])]
            upper_nesting_keys.append(line[1])
        
            current_nesting = nesting
            for nest in upper_nestings:
                current_nesting = current_nesting[nest]
        
            working_dict[line_str] = {}
            working_dict = working_dict[line_str]
            continue
        
        working_dict[line[0]] = ' '.join(line[1:])
    
print(result_dict)

Results in:

{
  'Report_for': 'Reconciliation', 
  'Execution_of': 'application_1673496470638_0001',
  'Spark_version': '2.4.7-amzn-0', 
  'Java_version': '1.8.0_352 (Amazon.com Inc.)', 
  'Start_time': '2023-01-12 09:45:13.360000', 
  'Spark': 'Properties: ', 
  'Job_ID 0': {
    'Submission_time': '2023-01-12 09:47:20.148000', 
    'Run_time': '73957ms', 
    'Result': 'JobSucceeded', 
    'Number_of_stages': '1', 
    'Stage_ID 0': {
      'Number_of_tasks': '16907', 
      'Number_of_executed_tasks': '16907', 
      'Completion_time': '73207ms', 
      'Stage_executed': 'parquet at RawDataPublisher.scala:53'
    }
  }, 
  'Job_ID 1': {
    'Submission_time': '2023-01-12 09:48:34.177000', 
    'Run_time': '11525ms', 
    'Result': 'JobSucceeded', 
    'Number_of_stages': '2', 
    'Stage_ID 1': {
      'Number_of_tasks': '16907', 
      'Number_of_executed_tasks': '0', 
      'Completion_time': '0ms', 
      'Stage_executed': 'parquet at RawDataPublisher.scala:53'
    }, 
    'Stage_ID 2': {
      'Number_of_tasks': '300', 
      'Number_of_executed_tasks': '300', 
      'Completion_time': '11520ms', 
      'Stage_executed': 'parquet at RawDataPublisher.scala:53'
    }
  }, 
  'Job_ID 2': {
    'Submission_time': 
    '2023-01-12 09:48:46.908000', 
    'Run_time': '218358ms', 
    'Result': 'JobSucceeded', 
    'Number_of_stages': '1', 
    'Stage_ID 3': {
      'Number_of_tasks': '1135', 
      'Number_of_executed_tasks': '1135', 
      'Completion_time': '218299ms', 
      'Stage_executed': 'parquet at RawDataPublisher.scala:53'
    }
  }
}

and should pretty much be generically usable for all kinds of nesting definitions from a flat input. Let me know if it works for you!