The problem :
I have 100 JSON with a fractal like structure of list of dicts. The width and the heigth of the data structure vary a lot from one JSON to another. Each labels are parts of a sentence.
test = [
{
"label": "I",
"children": [
{
"label": "want",
"children": [
{
"label": "a",
"children": [
{"label": "coffee"},
{"label": "big", "children": [{"label": "piece of cake"}]},
],
}
],
},
{"label": "need", "children": [{"label": "time"}]},
{"label": "like",
"children": [{"label": "italian", "children": [{"label": "pizza"}]}],
},
],
},
{
"label": "We",
"children": [
{"label": "are", "children": [{"label": "ok"}]},
{"label": "will", "children": [{"label": "rock you"}]},
],
},
]
I want to automate the normalization of JSON to obtain this type of output :
sentences = [
'I want a coffee',
'I want a big piece of cake',
'I need time',
'I like italian pizza',
'We are ok',
'We will rock you',
]
It's really like the os.walk
function that returns each "path".
What I tried :
pandas.json_normalize but it need to a predifine
meta
andrecord_path
arguments to work with complexe herarchies ;jsonpath_ng with
parse('[*]..label')
but I coudn't find the way to work this out ;flatten function like this post that obtains :
{'0label': 'I',
'0children_0label': 'want',
'0children_0children_0label': 'a',
'0children_0children_0children_0label': 'coffee',
'0children_0children_0children_1label': 'big',
'0children_0children_0children_1children_0label': 'piece of cake',
'0children_1label': 'need',
'0children_1children_0label': 'time',
'0children_2label': 'like',
'0children_2children_0label': 'italian',
'0children_2children_0children_0label': 'pizza',
'1label': 'We',
'1children_0label': 'are',
'1children_0children_0label': 'ok',
'1children_1label': 'will',
'1children_1children_0label': 'rock you'}
I tried to split keys to identify hierarchy but I have an indexation problem. For example, I don't understand why some keyslike '1children_0label' contains '0label' and not '1label' index that should refer to {'1label' : 'We'}.
- while loops that output a list of 'levels' as list of tuples containing count of n 1 children and label. It was meant to be the first step to recreate the final output but I'm couldn't work this out too.
import copy
levels = []
idx = [i for i in range(len(test))]
stack = copy.deepcopy(test)
lvl = 1
while stack:
idx = []
children = []
for i,d in enumerate(stack):
if 'children' in d:
n = len(d['children'])
else :
n = 0
occurences = (n,d['label'])
idx.append(occurences)
children = stack[i].copy()
if 'children' in stack[i]:
children.extend(stack[i]['children'])
stack = childs.copy()
children = []
levels.append(idx.copy())
print(levels)
Output :
[[(3, 'I'), (2, 'We')], [(1, 'want'), (1, 'need'), (1, 'like'), (1, 'are'), (1, 'will')], [(2, 'a'), (0, 'time'), (1, 'italian'), (0, 'ok'), (0, 'rock you')], [(0, 'coffee'), (1, 'big'), (0, 'pizza')], [(0, 'piece of cake')]]
Please help.
CodePudding user response:
You can try a recursion:
def get_sentences(o):
if isinstance(o, dict):
if "children" in o:
for item in get_sentences(o["children"]):
yield o["label"] " " item
else:
yield o["label"]
elif isinstance(o, list):
for v in o:
yield from get_sentences(v)
print(list(get_sentences(test)))
Prints:
[
"I want a coffee",
"I want a big piece of cake",
"I need time",
"I like italian pizza",
"We are ok",
"We will rock you",
]