Home > Software design >  Normalizing nested JSON in pandas
Normalizing nested JSON in pandas

Time:11-03

I am trying to address a list of dictionaries inside a list of JSON dictionaries to produce a normalized table of its contents.

One item from my JSON file:

import pandas as pd
file_json =
[
    {
        "id": 1,
        "type": "gin",
        "name": "1085 Toledo Gin",
        "first_name": null,
        "second_name": null,
        "producer": "Licores Caro",
        "country": "Spain",
        "abv": "40",
        "direct_purchase_url": null,
        "average_rating": "6.6",
        "rating_count": 74,
        "user_rating": null,
        "in_wishlist": null,
        "in_cabinet": null,
        "perfect_tonics": {
            "data": [
                {
                    "id": 2,
                    "type": "tonic",
                    "name": "Macario Tonica",
                    "average_rating": "7.5",
                    "rating_count": 28
                }
            ]
        },
        "perfect_garnishes": {
            "data": [
                {
                    "id": 5,
                    "type": "garnish",
                    "name": "Grapefruit"
                },
                {
                    "id": 4,
                    "type": "garnish",
                    "name": "Lime zest"
                },
                {
                    "id": 3,
                    "type": "garnish",
                    "name": "Orange zest"
                }
            ]
        }
    }
]

What I tried so far:

df = pd.json_normalize(file_json)

Produces a table, but with only the top level flattened

df_nested_test = pd.json_normalize(file_json, record_path=['perfect_tonics'])

df_nested_test = pd.json_normalize(file_json, record_path=['perfect_tonics.data'])

df_nested_test = pd.json_normalize(file_json, record_path=['data'])

These all throw errors.

df['perfect_tonics.data'].apply(pd.Series)

This is promising, but I dont know where to go from here.

I would appreciate some advice. Thanks!

CodePudding user response:

If data contains the parsed Json from the question, you can try:

df = pd.DataFrame(data)
df["perfect_tonics"] = df["perfect_tonics"].str["data"]
df["perfect_garnishes"] = df["perfect_garnishes"].str["data"]

df = df.explode("perfect_tonics").explode("perfect_garnishes")

df = pd.concat(
    [
        df,
        df.pop("perfect_tonics").apply(pd.Series).add_prefix("perfect_tonics_"),
    ],
    axis=1,
)

df = pd.concat(
    [
        df,
        df.pop("perfect_garnishes")
        .apply(pd.Series)
        .add_prefix("perfect_garnishes_"),
    ],
    axis=1,
)

print(df.to_markdown(index=False))

Prints:

id type name first_name second_name producer country abv direct_purchase_url average_rating rating_count user_rating in_wishlist in_cabinet perfect_tonics_id perfect_tonics_type perfect_tonics_name perfect_tonics_average_rating perfect_tonics_rating_count perfect_garnishes_id perfect_garnishes_type perfect_garnishes_name
1 gin 1085 Toledo Gin Licores Caro Spain 40 6.6 74 2 tonic Macario Tonica 7.5 28 5 garnish Grapefruit
1 gin 1085 Toledo Gin Licores Caro Spain 40 6.6 74 2 tonic Macario Tonica 7.5 28 4 garnish Lime zest
1 gin 1085 Toledo Gin Licores Caro Spain 40 6.6 74 2 tonic Macario Tonica 7.5 28 3 garnish Orange zest
  • Related