python ijson not working on multiple element at once-CodePudding

I have thousands of very large JSON files that I need to process on specific elements. To avoid memory overload I am using a python library called ijson which works fine when I am processing only a single element from the json file but when I try to process multiple-element at once it throughs

IncompleteJSONError: parse error: premature EOF

Partial JSON:

{
  "info": {
    "added": 1631536344.112968,
    "started": 1631537322.81162,
    "duration": 14,
    "ended": 1631537337.342377
  },
  "network": {
    "domains": [
      {
        "ip": "231.90.255.25",
        "domain": "dns.msfcsi.com"
      },
      {
        "ip": "12.23.25.44",
        "domain": "teo.microsoft.com"
      },
      {
        "ip": "87.101.90.42",
        "domain": "www.msf.com"
      }
    ]
  }
}

Working Code: (Multiple file open)

my_file_list = [f for f in glob.glob("data/jsons/*.json")]
final_result = []
for filename in my_file_list:
    row = {}
    with open(filename, 'r') as f:
        info = ijson.items(f, 'info')
        for o in info:
             row['added']= float(o.get('added'))
             row['started']= float(o.get('started'))
             row['duration']= o.get('duration')
             row['ended']= float(o.get('ended'))
    
    with open(filename, 'r') as f:
        domains = ijson.items(f, 'network.domains.item')
        domain_count = 0
        for domain in domains:
            domain_count =1
        row['domain_count'] = domain_count

Failure Code: (Single file open)

my_file_list = [f for f in glob.glob("data/jsons/*.json")]
final_result = []
for filename in my_file_list:
    row = {}
    with open(filename, 'r') as f:
        info = ijson.items(f, 'info')
        for o in info:
             row['added']= float(o.get('added'))
             row['started']= float(o.get('started'))
             row['duration']= o.get('duration')
             row['ended']= float(o.get('ended'))
    
        domains = ijson.items(f, 'network.domains.item')
        domain_count = 0
        for domain in domains:
            domain_count =1
        row['domain_count'] = domain_count

Not sure this is the reason Using python ijson to read a large json file with multiple json objects that ijson not able to work on multiple json element at once.

Also, let me know any other python package or any sample example that can handle large size json without memory issues.

CodePudding user response：

I think this is happening because you've finished reading your IO stream from the file, you're at the end already, and already asking for another query.

What you can do is to reset the cursor to the 0 position before the second query:

f.seek(0)

In a comment I said that you should try json-stream as well, but this is not an ijson or json-stream bug, it's a TextIO feature.

This is the equivalent of you opening the file a second time.

If you don't want to do this, then maybe you should look at iterating through every portion of the JSON, and then deciding for each object whether it has info or network.domains.item.

CodePudding user response：

While the answer above is correct, you can do better: if you know the structure of your JSON file and can rely on it, you can use this to your advantage and read the file only once.

ijson has an even interception mechanism, and the example there is very similar to what you want to achieve. In your case you want to get the info values, then iterate over the network.domains.item and count them. This should do:

row = {}
with open(filename, 'r') as f:
    parse_events = ijson.parse(f, use_float=True)
    for prefix, event, value in parse_events:
        if prefix == 'info.added':
            row['added'] = value
        elif prefix == 'info.started':
            row['started'] = value
        elif prefix == 'info.duration':
             row['duration'] = value
        elif prefix == 'info.ended':
             row['ended'] = value
        elif prefix == 'info' and event == 'end_map':
            break
    row['domain_count'] = sum(1 for _ in ijson.items(parse_events, 'network.domains.item'))

Note how:

ijson.items is fed with the result of ijson.parse.
use_float=True saves you from having to convert the values to float yourself.
The counting can be done by sum()-ing 1 for each item coming from ijson.items so you don't have to loop yourself manually.