Normalizing nested JSON in pandas-CodePudding

I am trying to address a list of dictionaries inside a list of JSON dictionaries to produce a normalized table of its contents.

One item from my JSON file:

import pandas as pd
file_json =
[
    {
        "id": 1,
        "type": "gin",
        "name": "1085 Toledo Gin",
        "first_name": null,
        "second_name": null,
        "producer": "Licores Caro",
        "country": "Spain",
        "abv": "40",
        "direct_purchase_url": null,
        "average_rating": "6.6",
        "rating_count": 74,
        "user_rating": null,
        "in_wishlist": null,
        "in_cabinet": null,
        "perfect_tonics": {
            "data": [
                {
                    "id": 2,
                    "type": "tonic",
                    "name": "Macario Tonica",
                    "average_rating": "7.5",
                    "rating_count": 28
                }
            ]
        },
        "perfect_garnishes": {
            "data": [
                {
                    "id": 5,
                    "type": "garnish",
                    "name": "Grapefruit"
                },
                {
                    "id": 4,
                    "type": "garnish",
                    "name": "Lime zest"
                },
                {
                    "id": 3,
                    "type": "garnish",
                    "name": "Orange zest"
                }
            ]
        }
    }
]

What I tried so far:

df = pd.json_normalize(file_json)

Produces a table, but with only the top level flattened

df_nested_test = pd.json_normalize(file_json, record_path=['perfect_tonics'])

df_nested_test = pd.json_normalize(file_json, record_path=['perfect_tonics.data'])

df_nested_test = pd.json_normalize(file_json, record_path=['data'])

These all throw errors.

df['perfect_tonics.data'].apply(pd.Series)

This is promising, but I dont know where to go from here.

I would appreciate some advice. Thanks!

CodePudding user response：

If data contains the parsed Json from the question, you can try:

df = pd.DataFrame(data)
df["perfect_tonics"] = df["perfect_tonics"].str["data"]
df["perfect_garnishes"] = df["perfect_garnishes"].str["data"]

df = df.explode("perfect_tonics").explode("perfect_garnishes")

df = pd.concat(
    [
        df,
        df.pop("perfect_tonics").apply(pd.Series).add_prefix("perfect_tonics_"),
    ],
    axis=1,
)

df = pd.concat(
    [
        df,
        df.pop("perfect_garnishes")
        .apply(pd.Series)
        .add_prefix("perfect_garnishes_"),
    ],
    axis=1,
)

print(df.to_markdown(index=False))

Prints:

id	type	name	producer	country	abv	average_rating	rating_count	perfect_tonics_id	perfect_tonics_type	perfect_tonics_name	perfect_tonics_average_rating	perfect_tonics_rating_count	perfect_garnishes_id	perfect_garnishes_type	perfect_garnishes_name
1	gin	1085 Toledo Gin	Licores Caro	Spain	40	6.6	74	2	tonic	Macario Tonica	7.5	28	5	garnish	Grapefruit
1	gin	1085 Toledo Gin	Licores Caro	Spain	40	6.6	74	2	tonic	Macario Tonica	7.5	28	4	garnish	Lime zest
1	gin	1085 Toledo Gin	Licores Caro	Spain	40	6.6	74	2	tonic	Macario Tonica	7.5	28	3	garnish	Orange zest