Automate fractal like nested JSON normalization-CodePudding

The problem :

I have 100 JSON with a fractal like structure of list of dicts. The width and the heigth of the data structure vary a lot from one JSON to another. Each labels are parts of a sentence.

test = [
    {
        "label": "I",
        "children": [
            {
                "label": "want",
                "children": [
                    {
                        "label": "a",
                        "children": [
                            {"label": "coffee"},
                            {"label": "big", "children": [{"label": "piece of cake"}]},
                        ],
                    }
                ],
            },
            {"label": "need", "children": [{"label": "time"}]},
            {"label": "like",
                "children": [{"label": "italian", "children": [{"label": "pizza"}]}],
            },
        ],
    },
    {
        "label": "We",
        "children": [
            {"label": "are", "children": [{"label": "ok"}]},
            {"label": "will", "children": [{"label": "rock you"}]},
        ],
    },
]

I want to automate the normalization of JSON to obtain this type of output :

sentences = [
'I want a coffee', 
'I want a big piece of cake', 
'I need time', 
'I like italian pizza', 
'We are ok',
'We will rock you',
]

It's really like the os.walk function that returns each "path".

What I tried :

pandas.json_normalize but it need to a predifine meta and record_path arguments to work with complexe herarchies ;
jsonpath_ng with parse('[*]..label') but I coudn't find the way to work this out ;
flatten function like this post that obtains :

{'0label': 'I',
 '0children_0label': 'want',
 '0children_0children_0label': 'a',
 '0children_0children_0children_0label': 'coffee',
 '0children_0children_0children_1label': 'big',
 '0children_0children_0children_1children_0label': 'piece of cake',
 '0children_1label': 'need',
 '0children_1children_0label': 'time',
 '0children_2label': 'like',
 '0children_2children_0label': 'italian',
 '0children_2children_0children_0label': 'pizza',
 '1label': 'We',
 '1children_0label': 'are',
 '1children_0children_0label': 'ok',
 '1children_1label': 'will',
 '1children_1children_0label': 'rock you'}

I tried to split keys to identify hierarchy but I have an indexation problem. For example, I don't understand why some keyslike '1children_0label' contains '0label' and not '1label' index that should refer to {'1label' : 'We'}.

while loops that output a list of 'levels' as list of tuples containing count of n 1 children and label. It was meant to be the first step to recreate the final output but I'm couldn't work this out too.

import copy
levels = []
idx = [i for i in range(len(test))]
stack = copy.deepcopy(test)
lvl = 1
while stack: 
    idx = []
    children = []
    for i,d in enumerate(stack):
        if 'children' in d:
            n = len(d['children'])
        else : 
            n = 0
        occurences = (n,d['label'])
        idx.append(occurences)
        
        children = stack[i].copy()
        if 'children' in stack[i]:
            children.extend(stack[i]['children'])
    
    stack = childs.copy()
    children = []
    levels.append(idx.copy())       

print(levels)

Output :

[[(3, 'I'), (2, 'We')], [(1, 'want'), (1, 'need'), (1, 'like'), (1, 'are'), (1, 'will')], [(2, 'a'), (0, 'time'), (1, 'italian'), (0, 'ok'), (0, 'rock you')], [(0, 'coffee'), (1, 'big'), (0, 'pizza')], [(0, 'piece of cake')]]

Please help.

CodePudding user response：

You can try a recursion:

def get_sentences(o):
    if isinstance(o, dict):
        if "children" in o:
            for item in get_sentences(o["children"]):
                yield o["label"]   " "   item
        else:
            yield o["label"]
    elif isinstance(o, list):
        for v in o:
            yield from get_sentences(v)


print(list(get_sentences(test)))

Prints:

[
    "I want a coffee",
    "I want a big piece of cake",
    "I need time",
    "I like italian pizza",
    "We are ok",
    "We will rock you",
]