I have thousands of very large JSON files that I need to process on specific elements. To avoid memory overload I am using a python library called ijson which works fine when I am processing only a single element from the json file but when I try to process multiple-element at once it throughs
IncompleteJSONError: parse error: premature EOF
Partial JSON:
{
"info": {
"added": 1631536344.112968,
"started": 1631537322.81162,
"duration": 14,
"ended": 1631537337.342377
},
"network": {
"domains": [
{
"ip": "231.90.255.25",
"domain": "dns.msfcsi.com"
},
{
"ip": "12.23.25.44",
"domain": "teo.microsoft.com"
},
{
"ip": "87.101.90.42",
"domain": "www.msf.com"
}
]
}
}
Working Code: (Multiple file open)
my_file_list = [f for f in glob.glob("data/jsons/*.json")]
final_result = []
for filename in my_file_list:
row = {}
with open(filename, 'r') as f:
info = ijson.items(f, 'info')
for o in info:
row['added']= float(o.get('added'))
row['started']= float(o.get('started'))
row['duration']= o.get('duration')
row['ended']= float(o.get('ended'))
with open(filename, 'r') as f:
domains = ijson.items(f, 'network.domains.item')
domain_count = 0
for domain in domains:
domain_count =1
row['domain_count'] = domain_count
Failure Code: (Single file open)
my_file_list = [f for f in glob.glob("data/jsons/*.json")]
final_result = []
for filename in my_file_list:
row = {}
with open(filename, 'r') as f:
info = ijson.items(f, 'info')
for o in info:
row['added']= float(o.get('added'))
row['started']= float(o.get('started'))
row['duration']= o.get('duration')
row['ended']= float(o.get('ended'))
domains = ijson.items(f, 'network.domains.item')
domain_count = 0
for domain in domains:
domain_count =1
row['domain_count'] = domain_count
Not sure this is the reason Using python ijson to read a large json file with multiple json objects that ijson not able to work on multiple json element at once.
Also, let me know any other python package or any sample example that can handle large size json without memory issues.
CodePudding user response:
I think this is happening because you've finished reading your IO stream from the file, you're at the end already, and already asking for another query.
What you can do is to reset the cursor to the 0 position before the second query:
f.seek(0)
In a comment I said that you should try json-stream
as well, but this is not an ijson
or json-stream
bug, it's a TextIO feature.
This is the equivalent of you opening the file a second time.
If you don't want to do this, then maybe you should look at iterating through every portion of the JSON, and then deciding for each object whether it has info
or network.domains.item
.
CodePudding user response:
While the answer above is correct, you can do better: if you know the structure of your JSON file and can rely on it, you can use this to your advantage and read the file only once.
ijson
has an even interception mechanism, and the example there is very similar to what you want to achieve. In your case you want to get the info
values, then iterate over the network.domains.item
and count them. This should do:
row = {}
with open(filename, 'r') as f:
parse_events = ijson.parse(f, use_float=True)
for prefix, event, value in parse_events:
if prefix == 'info.added':
row['added'] = value
elif prefix == 'info.started':
row['started'] = value
elif prefix == 'info.duration':
row['duration'] = value
elif prefix == 'info.ended':
row['ended'] = value
elif prefix == 'info' and event == 'end_map':
break
row['domain_count'] = sum(1 for _ in ijson.items(parse_events, 'network.domains.item'))
Note how:
ijson.items
is fed with the result ofijson.parse
.use_float=True
saves you from having to convert the values tofloat
yourself.- The counting can be done by
sum()
-ing1
for each item coming fromijson.items
so you don't have to loop yourself manually.