How to keep flatten JSON data regardless of level of nesting it is without specifying the levels in-CodePudding

I have a JSON tree that looks like this:

└─  (array)
   └─  (object)
      ├─ "site" (string)
      ├─ "title" (string)
      ├─ "child" (array)
      │  └─  (object)
      │     ├─ "type" (string)
      │     ├─ "value" (string)
      │     ├─ "child" (array)
      │     │  └─  (object)
      │     │     ├─ "type" (string)
      │     │     ├─ "value" (string)
      │     │     ├─ "child" (array)
      │     │     │  └─  (object)
      │     │     │     ├─ "type" (string)
      │     │     │     ├─ "key" (string)
      │     │     │     ├─ "id" (string)
      │     │     │     └─ "value" (string)
      │     │     ├─ "id" (string)
      │     │     └─ "key" (string)
      │     ├─ "id" (string)
      │     └─ "key" (string)
      ├─ "featured_image_url" (string)
      ├─ "id" (string)
      └─ "key" (string)

I think this represents the above data:

data = [{"site": "example", "title": "Example Site", "child": [{"type": "text", "value": "This is some text.", "child": [{"type": "image", "value": "image.jpg", "child": [{"type": "link", "key": "link1", "id": "link1", "value": "Link 1"}], "id": "text1", "key": "text1"}, {"type": "image", "value": "image2.jpg", "child": [{"type": "link", "key": "link2", "id": "link2", "value": "Link 2"}], "id": "text2", "key": "text2"}], "id": "root", "key": "root"}, {"featured_image_url": "featured_image.jpg", "id": "root2", "key": "root2"}]}]

My goal is too extract all data in every child array regardless of the level of nesting. child is list of dictionary objects.

I have tried using pd.json_normalize with record path set to child per level, but at some point the dataframe does not have records for every record in the child column. Some values are nan some are empty lists.

df = pd.json_normalize(data, record_path=['child', 'child', 'child'],
                       meta=['site', 'title', 'featured_image_url',
                             'id', 'key','child'],
                       meta_prefix='_')

The df.child from the above code looks like this:

                       child
0                                                    NaN
1      [{'type': 'HTML', 'key': '77100c3f-760d-452a-a...
2      [{'type': 'HTML', 'key': '6fce8911-7e46-42e5-b...
3      []

Is there anyway I can extract all data from every child array regardless of nesting for data for type, key, id, value into a dataframe.

CodePudding user response：

from pprint import PrettyPrinter

data = [{"site": "example", "title": "Example Site", "child": [{"type": "text", "value": "This is some text.", "child": [{"type": "image", "value": "image.jpg", "child": [{"type": "link", "key": "link1", "id": "link1", "value": "Link 1"}], "id": "text1", "key": "text1"}, {"type": "image", "value": "image2.jpg", "child": [{"type": "link", "key": "link2", "id": "link2", "value": "Link 2"}], "id": "text2", "key": "text2"}], "id": "root", "key": "root"}, {"featured_image_url": "featured_image.jpg", "id": "root2", "key": "root2"}]}]

items = data
result = []
for i in items:
    if isinstance(i, list):
        items.extend(i)
    elif isinstance(i, dict):
        entry = {k:v for k, v in i.items() if isinstance(v, str)}
        result.append(entry)
        child = i.pop("child", None)
        if child is not None:
            items.extend(child)

pp = PrettyPrinter(indent=4)
pp.pprint(result)

Output

[   {'site': 'example', 'title': 'Example Site'},
    {   'id': 'root',
        'key': 'root',
        'type': 'text',
        'value': 'This is some text.'},
    {'featured_image_url': 'featured_image.jpg', 'id': 'root2', 'key': 'root2'},
    {'id': 'text1', 'key': 'text1', 'type': 'image', 'value': 'image.jpg'},
    {'id': 'text2', 'key': 'text2', 'type': 'image', 'value': 'image2.jpg'},
    {'id': 'link1', 'key': 'link1', 'type': 'link', 'value': 'Link 1'},
    {'id': 'link2', 'key': 'link2', 'type': 'link', 'value': 'Link 2'}]

You can now prepare/modify result, so that it fits into a pandas DataFrame.