I have a JSON tree that looks like this:
└─ (array)
└─ (object)
├─ "site" (string)
├─ "title" (string)
├─ "child" (array)
│ └─ (object)
│ ├─ "type" (string)
│ ├─ "value" (string)
│ ├─ "child" (array)
│ │ └─ (object)
│ │ ├─ "type" (string)
│ │ ├─ "value" (string)
│ │ ├─ "child" (array)
│ │ │ └─ (object)
│ │ │ ├─ "type" (string)
│ │ │ ├─ "key" (string)
│ │ │ ├─ "id" (string)
│ │ │ └─ "value" (string)
│ │ ├─ "id" (string)
│ │ └─ "key" (string)
│ ├─ "id" (string)
│ └─ "key" (string)
├─ "featured_image_url" (string)
├─ "id" (string)
└─ "key" (string)
I think this represents the above data:
data = [{"site": "example", "title": "Example Site", "child": [{"type": "text", "value": "This is some text.", "child": [{"type": "image", "value": "image.jpg", "child": [{"type": "link", "key": "link1", "id": "link1", "value": "Link 1"}], "id": "text1", "key": "text1"}, {"type": "image", "value": "image2.jpg", "child": [{"type": "link", "key": "link2", "id": "link2", "value": "Link 2"}], "id": "text2", "key": "text2"}], "id": "root", "key": "root"}, {"featured_image_url": "featured_image.jpg", "id": "root2", "key": "root2"}]}]
My goal is too extract all data in every child
array regardless of the level of nesting. child
is list of dictionary objects.
I have tried using pd.json_normalize
with record path set to child
per level, but at some point the dataframe does not have records for every record in the child
column. Some values are nan
some are empty lists.
df = pd.json_normalize(data, record_path=['child', 'child', 'child'],
meta=['site', 'title', 'featured_image_url',
'id', 'key','child'],
meta_prefix='_')
The df.child
from the above code looks like this:
child
0 NaN
1 [{'type': 'HTML', 'key': '77100c3f-760d-452a-a...
2 [{'type': 'HTML', 'key': '6fce8911-7e46-42e5-b...
3 []
Is there anyway I can extract all data from every child
array regardless of nesting for data for type
, key
, id
, value
into a dataframe.
CodePudding user response:
from pprint import PrettyPrinter
data = [{"site": "example", "title": "Example Site", "child": [{"type": "text", "value": "This is some text.", "child": [{"type": "image", "value": "image.jpg", "child": [{"type": "link", "key": "link1", "id": "link1", "value": "Link 1"}], "id": "text1", "key": "text1"}, {"type": "image", "value": "image2.jpg", "child": [{"type": "link", "key": "link2", "id": "link2", "value": "Link 2"}], "id": "text2", "key": "text2"}], "id": "root", "key": "root"}, {"featured_image_url": "featured_image.jpg", "id": "root2", "key": "root2"}]}]
items = data
result = []
for i in items:
if isinstance(i, list):
items.extend(i)
elif isinstance(i, dict):
entry = {k:v for k, v in i.items() if isinstance(v, str)}
result.append(entry)
child = i.pop("child", None)
if child is not None:
items.extend(child)
pp = PrettyPrinter(indent=4)
pp.pprint(result)
Output
[ {'site': 'example', 'title': 'Example Site'},
{ 'id': 'root',
'key': 'root',
'type': 'text',
'value': 'This is some text.'},
{'featured_image_url': 'featured_image.jpg', 'id': 'root2', 'key': 'root2'},
{'id': 'text1', 'key': 'text1', 'type': 'image', 'value': 'image.jpg'},
{'id': 'text2', 'key': 'text2', 'type': 'image', 'value': 'image2.jpg'},
{'id': 'link1', 'key': 'link1', 'type': 'link', 'value': 'Link 1'},
{'id': 'link2', 'key': 'link2', 'type': 'link', 'value': 'Link 2'}]
You can now prepare/modify result
, so that it fits into a pandas DataFrame.