Normalizing json using pandas with inconsistent nested lists/dictionaries-CodePudding

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109

I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.

Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!

import json
import pandas as pd

data = """[
    {
        "ID": "100",
        "Ats": {
            "Ats": [
                {
                    "Name": "At1",
                    "Desc": "Lazy At"
                }
            ]
        }
    },
    {
        "ID": "101",
        "Ats": null
    }
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()

TypeError: 'NoneType' object is not iterable

I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.

The desired output would look like this: Dataframe

CodePudding user response：

Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:

import json

import pandas as pd

data = """\
[
    {
        "ID": "100",
        "Ats": {
            "Ats": [
                {
                    "Name": "At1",
                    "Desc": "Lazy At"
                }
            ]
        }
    },
    {
        "ID": "101",
        "Ats": null
    }
]"""

data = json.loads(data)

df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")

df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)

Prints:

    ID Name     Desc
0  100  At1  Lazy At
1  101  NaN      NaN

CodePudding user response：

Normalize in try and except and if error is found it should append new row to DataFrame with NAN.

Example:

data = json.loads(data)
df = pd.DataFrame()
for item in data:
    try:
        df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID")) 
    except TypeError:
        df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)

print(df)

Output:

  Name     Desc   ID
0  At1  Lazy At  100
1  NaN      NaN  101