Using Python to recursively map a JSON file for all keys, while also recording the depth of "ne-CodePudding

Background:

I have a large JSON file (tens of GB) where each "line" of the file is a log entry (also in JSON format). The logs come from several different sources, so the fields in one log may or may not show up in any other log. In other words, the "keys" of the JSON format, and the degree to which the keys are nested within each other, is not consistent from log to log.

My Goal:

I need to be able to find all of the keys in any given log [already accomplished], and keep track of which keys are nested under which other keys [this is the part I haven't solved yet].

My work so far:

I can't load the file into memory all at once due to its size, so I've been examining it line by line (log by log). I've redacted the actual content, but let's say the first line (log) of the file looks like this:

{'foo': {'bar': 'data',
         'biz': 'data'},
 'baz': 'data',
 'qux': {'quux': 'data',
         'quuz': 'data'},
 'corge': {'grault': {'garply': 'data'},
           'waldo': {'fred': 'data',
                     'plugh': 'data'},
           'xyzzy': 'data'},
 'thud': 'data'}

I've been using this code...

def find_nested_dicts(d):
    for k,v in d.items():
        KEYS.append(k)
        if isinstance(d[k], dict):
            find_nested_dicts(d[k])
    return KEYS

KEYS=[]
with open(full_geodata_path, "r") as f:
    next_line = f.readline()
    next_line_json = json.loads(next_line)
    output = find_nested_dicts(next_line_json)

...to produce this:

output = [
    'foo', 
    'bar', 
    'biz', 
    'baz', 
    'qux', 
    'quux', 
    'quuz', 
    'corge', 
    'grault', 
    'garply', 
    'waldo',
    'fred', 
    'plugh',
    'xyzzy',
    'thud'
]

I don't love using KEYS as a global variable, but I wasn't able to get it working locally.

Note that order in which each key is appended to this list is important. The names of the keys rarely have anything that indicates which parent key they are nested under. In other words, I've intentionally written to code so the keys are appended to the output list in the same order they are presented in the JSON format.

The result shown above is halfway to my goal. However, I want to be able to produce a list containing each key, AND the nested depth at which it was retrieved. I'm trying to build a list that looks like this:

output = [ 
    ('foo', 0), 
    ('bar', 1), 
    ('biz', 1), 
    ('baz', 0), 
    ('qux', 0), 
    ('quux', 1), 
    ('quuz', 1), 
    ('corge', 0), 
    ('grault', 1), 
    ('garply', 2), 
    ('waldo', 1),
    ('fred', 2), 
    ('plugh', 2),
    ('xyzzy', 1),
    ('thud', 0)
]

This function is probably the closest I've gotten:

def find_nested_dicts(d, depth=0):
    for k, v in d.items():
        KEYS.append((k, depth))
        depth  = 1
        if isinstance(v, dict):
            find_nested_dicts(v, depth)
        else:
            depth -= 1
    return KEYS

The problem with this attempt is this function only every causes the "depth" counter to increase.

I've tried mapping the recursion out on paper, and tried moving around the depth = 1, depth -= 1, and KEYS.append((k, depth)) statements in and out of the for and if loops in various combinations hoping for a better result.

At this point, I'm not sure if it's my own inability to fully conceptualize recursion, or whether what I am trying to do is actually possible with normal python. Any insight is much appreciated.

One final note, I do not have admin rights to the machine I am attempting to solve this problem on, and accessing this data from my personal machine is not an option. From previous projects I know I have the following libraries available for use:

import numpy as np
import os
import pandas as pd
import json
import itertools
from collections import Counter
import matplotlib.pyplot as plt

CodePudding user response：

def foo(d, arr, n):
    if isinstance(d, dict):
        for k in d.keys():
            arr.append((k, n))
            foo(d[k], arr, n   1)
            
ans = []
foo(data, ans, 0)
ans
# [('foo', 0),
#  ('bar', 1),
#  ('biz', 1),
#  ('baz', 0),
#  ('qux', 0),
#  ('quux', 1),
#  ('quuz', 1),
#  ('corge', 0),
#  ('grault', 1),
#  ('garply', 2),
#  ('waldo', 1),
#  ('fred', 2),
#  ('plugh', 2),
#  ('xyzzy', 1),
#  ('thud', 0)]